Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

debug

Name: Debug
Author: fcakyon

// Evidence-before-action diagnosis of failing ML experiments. Probes the system before guessing causes, process list, dmesg, GPU stats, log scrollback, checkpoint state, then states a hypothesis as a hypothesis and runs a smoke before claiming a root cause. Use when the user asks why a run is failing, diverging, OOMing, hanging, slow, producing weird metrics, has crashed, or asks to debug, diagnose, troubleshoot, or investigate a training issue.

Exécuter dans Manus

$ git log --oneline --stat

stars:250

forks:24

updated:30 avril 2026 à 11:03

SKILL.md

readonly

name

debug

description

Evidence-before-action diagnosis of failing ML experiments. Probes the system before guessing causes, process list, dmesg, GPU stats, log scrollback, checkpoint state, then states a hypothesis as a hypothesis and runs a smoke before claiming a root cause. Use when the user asks why a run is failing, diverging, OOMing, hanging, slow, producing weird metrics, has crashed, or asks to debug, diagnose, troubleshoot, or investigate a training issue.

Debug: evidence-before-action investigation

The most expensive class of mistake in ML debugging is asserting a cause based on plausibility, then attempting a "fix" that masks the real problem. This skill enforces the discipline of probe → hypothesis → smoke → controls → claim, in that order.

The agentic Stop hook routes here from reason when an assistant claims a cause without backing tool output.

When to run

The user just said any of:

"why is X failing / diverging / NaN / OOM / hung / slow / crashed"
"the loss is going up", "metrics look weird", "GPU util is 0"
"debug this", "diagnose", "troubleshoot", "investigate this run"
pasted a log excerpt asking what's wrong

Five-step protocol

Step 1: cheap probes

Before forming any hypothesis, gather the cheap evidence. None of these cost more than a few seconds:

Process state:

ps aux | grep -E '(python|train|torchrun|accelerate)' | grep -v grep

Is the process still running? Zombie? Defunct? Multiple instances?

Kernel / system events:

dmesg | tail -100 # OOM kills, hardware errors, NFS errors
journalctl -xe --since "1 hour ago" | tail -50

GPU state:

nvidia-smi
nvidia-smi --query-gpu=utilization.gpu,memory.used,temperature.gpu --format=csv

Is the GPU even being used? Idle GPU during "training" means the process is blocked on data loading or has died.

Disk / filesystem:

df -h /path/to/run-dir
du -sh /path/to/run-dir/*

Out of disk? Checkpoints not being written?

Log scrollback: Read the last few hundred lines of the training log. Don't trust the user's summary, they may have skimmed. Look for:

exception tracebacks
repeated "loss=NaN" or "grad_norm=Inf"
early-stop announcements (the run may have completed normally)
the last successful epoch / step (where did progress stop)

Checkpoint state:

ls -la /path/to/run-dir/checkpoints/

When was the last checkpoint written? What does its size suggest? An empty .pt is different from a 2GB one cut short.

Step 2: hypothesis (labeled as hypothesis)

After the probe, state what might be happening, explicitly framed as a hypothesis:

"Hypothesis: the run is OOMing because dmesg shows oom-kill 3 minutes ago and the process is gone. Alternative hypotheses I haven't ruled out: (a) NFS write timeout, (b) explicit kill from a sibling process."

Never skip to "the cause is X." The hypothesis labels what you don't yet know.

Step 3: smoke run

The cheapest way to confirm or refute a hypothesis is to reproduce the failure shape under a controlled condition:

OOM hypothesis: rerun with batch_size=1 for 1 step. If it survives, OOM is confirmed; if it fails the same way, OOM is wrong.
Data hypothesis: rerun with a synthetic in-memory dataset. If it works, the data path is implicated.
Model hypothesis: forward pass only on a single batch with eval() mode. Loss finite? Outputs sane?
Optimizer hypothesis: rerun with lr=0. If the loss still explodes, the loss itself is broken (not the optimizer).
Distributed hypothesis: rerun on 1 GPU. If it works, DDP / NCCL is implicated.

A 30-second smoke beats a 30-minute restart-and-pray.

Step 4: controls

If the smoke is ambiguous, run a control: change exactly one variable from the failing config and rerun the smoke. The differences narrow what mechanism is responsible.

Common control axes (change one at a time):

single-source vs multi-source data
default workers vs adjusted workers
mixed-precision on vs off
gradient checkpointing on vs off
torch.compile on vs off

Step 5: claim cause

Only after evidence stacks up, probe, smoke, control, do you assert a cause. The claim should cite the specific tool output that proves it:

"Root cause: NFS write timeout. Evidence: dmesg shows nfs server X not responding at 14:23 (the same minute the last checkpoint was written), and the smoke with batch=1 reproduces the timeout. Recommended fix: bind-mount a local scratch dir for checkpoints and rsync to NFS at end of epoch."

If the evidence isn't stacking up, do not promote a hypothesis to a cause. Say "I don't yet know" and propose the next probe.

What to avoid

"It's probably X, let me try Y" → no. Probe first.
Restarting the run with a small change as the diagnostic. Smoke first, then restart deliberately.
Citing only the user's narrative as evidence: re-read the actual log.
Stopping at the first plausible cause when artifacts contradict it.

Output

A concise diagnostic report: (1) what the probes showed, (2) the hypothesis, (3) the smoke outcome, (4) the cause-or-uncertain verdict, (5) the recommended next action. Each claim cites the tool output that backs it.

related-skills.json

même dépôt

compare.md

from "fcakyon/phd-skills"

Same-epoch comparison of training runs across wandb, neptune, tensorboard, or mlflow. Aligns runs at the student's current step (never current-vs-final-of-baseline) and separates proxy metrics from downstream targets. Use when the user asks to compare runs, check if a run is improving, track lag against a baseline, rank experiments, or evaluate run-vs-run performance.

2026-04-30250

launch.md

from "fcakyon/phd-skills"

Pre-flight checklist for long-running ML training jobs covering config diff, run naming, path verification, monitoring setup, and restart-cleanup. Use when the user asks to launch, kick off, start, restart, or kill a training run, or mentions launching a multi-hour or multi-day GPU job (python train, accelerate launch, torchrun, deepspeed, sbatch, tmux training).

2026-04-30250

reproduce.md

from "fcakyon/phd-skills"

End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.

2026-04-30250

dataset-curation.md

from "fcakyon/phd-skills"

Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review".

2026-03-12250

experiment-design.md

from "fcakyon/phd-skills"

Use when the user wants to design experiments, plan ablation studies, structure baselines, or create incremental evaluation strategies. Triggers on phrases like "design ablation", "plan experiment", "what experiments should I run", "baseline comparison", or "experiment matrix".

2026-03-12250

latex-setup.md

from "fcakyon/phd-skills"

Use when the user wants to set up or troubleshoot a LaTeX environment, choose between biber and bibtex, install packages for a specific venue template, or configure compilation. Triggers on phrases like "setup latex", "biber vs bibtex", "latex compilation error", "install latex packages", "venue template", or "texlive setup".

2026-03-12250

package.json

"author": "fcakyon"

"repository": "fcakyon/phd-skills"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Scientifiques des donnéesProfessions informatiques et mathématiques15-2051L4

name

debug

description

Debug: evidence-before-action investigation

The agentic Stop hook routes here from reason when an assistant claims a cause without backing tool output.

When to run

The user just said any of:

"why is X failing / diverging / NaN / OOM / hung / slow / crashed"
"the loss is going up", "metrics look weird", "GPU util is 0"
"debug this", "diagnose", "troubleshoot", "investigate this run"
pasted a log excerpt asking what's wrong

Five-step protocol

Step 1: cheap probes

Before forming any hypothesis, gather the cheap evidence. None of these cost more than a few seconds:

Process state:

ps aux | grep -E '(python|train|torchrun|accelerate)' | grep -v grep

Is the process still running? Zombie? Defunct? Multiple instances?

Kernel / system events:

dmesg | tail -100 # OOM kills, hardware errors, NFS errors
journalctl -xe --since "1 hour ago" | tail -50

GPU state:

nvidia-smi
nvidia-smi --query-gpu=utilization.gpu,memory.used,temperature.gpu --format=csv

Is the GPU even being used? Idle GPU during "training" means the process is blocked on data loading or has died.

Disk / filesystem:

df -h /path/to/run-dir
du -sh /path/to/run-dir/*

Out of disk? Checkpoints not being written?

Log scrollback: Read the last few hundred lines of the training log. Don't trust the user's summary, they may have skimmed. Look for:

exception tracebacks
repeated "loss=NaN" or "grad_norm=Inf"
early-stop announcements (the run may have completed normally)
the last successful epoch / step (where did progress stop)

Checkpoint state:

ls -la /path/to/run-dir/checkpoints/

When was the last checkpoint written? What does its size suggest? An empty .pt is different from a 2GB one cut short.

Step 2: hypothesis (labeled as hypothesis)

After the probe, state what might be happening, explicitly framed as a hypothesis:

"Hypothesis: the run is OOMing because dmesg shows oom-kill 3 minutes ago and the process is gone. Alternative hypotheses I haven't ruled out: (a) NFS write timeout, (b) explicit kill from a sibling process."

Never skip to "the cause is X." The hypothesis labels what you don't yet know.

Step 3: smoke run

The cheapest way to confirm or refute a hypothesis is to reproduce the failure shape under a controlled condition:

OOM hypothesis: rerun with batch_size=1 for 1 step. If it survives, OOM is confirmed; if it fails the same way, OOM is wrong.
Data hypothesis: rerun with a synthetic in-memory dataset. If it works, the data path is implicated.
Model hypothesis: forward pass only on a single batch with eval() mode. Loss finite? Outputs sane?
Optimizer hypothesis: rerun with lr=0. If the loss still explodes, the loss itself is broken (not the optimizer).
Distributed hypothesis: rerun on 1 GPU. If it works, DDP / NCCL is implicated.

A 30-second smoke beats a 30-minute restart-and-pray.

Step 4: controls

If the smoke is ambiguous, run a control: change exactly one variable from the failing config and rerun the smoke. The differences narrow what mechanism is responsible.

Common control axes (change one at a time):

single-source vs multi-source data
default workers vs adjusted workers
mixed-precision on vs off
gradient checkpointing on vs off
torch.compile on vs off

Step 5: claim cause

Only after evidence stacks up, probe, smoke, control, do you assert a cause. The claim should cite the specific tool output that proves it:

"Root cause: NFS write timeout. Evidence: dmesg shows nfs server X not responding at 14:23 (the same minute the last checkpoint was written), and the smoke with batch=1 reproduces the timeout. Recommended fix: bind-mount a local scratch dir for checkpoints and rsync to NFS at end of epoch."

If the evidence isn't stacking up, do not promote a hypothesis to a cause. Say "I don't yet know" and propose the next probe.

What to avoid

"It's probably X, let me try Y" → no. Probe first.
Restarting the run with a small change as the diagnostic. Smoke first, then restart deliberately.
Citing only the user's narrative as evidence: re-read the actual log.
Stopping at the first plausible cause when artifacts contradict it.

debug

Debug: evidence-before-action investigation

When to run

Five-step protocol

Step 1: cheap probes

Step 2: hypothesis (labeled as hypothesis)

Step 3: smoke run

Step 4: controls

Step 5: claim cause

What to avoid

Output

Plus depuis ce dépôt

Plus depuis ce dépôt

Debug: evidence-before-action investigation

When to run

Five-step protocol

Step 1: cheap probes

Step 2: hypothesis (labeled as hypothesis)

Step 3: smoke run

Step 4: controls

Step 5: claim cause

What to avoid

Output