with one click
debug-training-logs
// Debug distributed training failures (NeMo, Megatron, PyTorch) from worker stderr logs and optional AIStore daemon logs. Finds root cause across NCCL timeouts, data loading errors, and storage failures.
// Debug distributed training failures (NeMo, Megatron, PyTorch) from worker stderr logs and optional AIStore daemon logs. Finds root cause across NCCL timeouts, data loading errors, and storage failures.
Get a pull request to green CI. Diagnose and fix CI failures, push fixes, re-trigger CI via the "Run CICD" label, and repeat until all checks pass. Does not post comments — this is a local developer tool.
Fix a GitHub issue in NeMo Speech (NVIDIA-NeMo/NeMo). Read the issue, reproduce the bug with a failing test, implement the fix, and verify tests pass. Only opens a PR if the user explicitly asks for it.
Run style checks and tests on changed files to verify code quality before committing.
| name | debug-training-logs |
| description | Debug distributed training failures (NeMo, Megatron, PyTorch) from worker stderr logs and optional AIStore daemon logs. Finds root cause across NCCL timeouts, data loading errors, and storage failures. |
| disable-model-invocation | true |
| allowed-tools | Bash Read Grep Glob Agent |
| argument-hint | <path-to-logs-dir> [ais-logs-dir] |
You are debugging a distributed training job failure. The user will provide one or two directories:
$ARGUMENTS[0] (required)$ARGUMENTS[1] (optional)Your goal: find the root cause, not just the symptom. There are often cascading failures — one root cause triggers many downstream errors. Trace backwards from the final crash to the original trigger.
You MUST double-check every conclusion before presenting it to the user. Distributed training failures have cascading effects that make it easy to mistake a symptom for the root cause. Follow these rules:
sort -u on the full output to find outliers. A single outlier rank can be the entire root cause.last enqueued work and last completed work for all ranks and check for ANY rank that differs. A rank with enqueued == completed (no pending ops) is fundamentally different from ranks with enqueued > completed (ops pending) — it means that rank never entered the stuck collective.If the user has not provided worker logs, ask them to download and provide the SLURM error logs. The logs are typically on the compute nodes or in a shared filesystem. Suggest:
# From the user's machine, SCP the SLURM error logs:
scp <cluster>:/path/to/slurm/logs/error-<JOBID>-*.out ./training_logs/
# Or if the logs are on a shared filesystem accessible from a login node:
mkdir -p ./training_logs
cp /path/to/slurm/error-<JOBID>-*.out ./training_logs/
The user needs to provide ALL per-node error files (e.g., error-JOBID-0.out through error-JOBID-N.out) — not just one node. Without all files you cannot identify which specific rank caused the failure.
Read the first 80 lines of a few log files to determine:
Search ALL log files in parallel for these patterns (priority order):
Tier 1 — Process killers:
NCCL.*timeout|Watchdog caught collective operation timeout
taking the entire process down
SIGTERM|SIGKILL|SIGABRT
CUDA error|CUDA out of memory|OOM
Tier 2 — Training loop crashes:
RuntimeError|Exception.*Error
AISBatchLoaderError|StopIteration
Traceback \(most recent call last\)
Tier 3 — Data loading / IO:
Connection reset|Connection broken|Connection refused
retrying [0-9]+/[0-9]+
timed out|deadline exceeded
broken pipe
For NCCL timeouts, you MUST check ALL ranks — do not sample. Extract the full NCCL state for every rank:
# Get watchdog state for all ranks that fired their own watchdog:
grep "failure detected by watchdog" error-*.out | grep -o "Rank [0-9]*.*last enqueued work: [0-9]*, last completed work: [0-9]*" | sort -u
# Get ranks that were NOTIFIED (did not fire their own watchdog):
grep "Observed flight recorder dump signal from another rank" error-*.out | grep -o "Rank [0-9]*"
# Count total unique ranks found vs expected:
grep "failure detected by watchdog" error-*.out | grep -o "Rank [0-9]*" | sort -t' ' -k2 -n -u | wc -l
The rank that "Observed" instead of "detected" is likely the straggler — it had no pending NCCL operations because it never entered the collective. Check its Last enqueued NCCL work — if it equals last completed NCCL work, the rank was stuck OUTSIDE NCCL (in the training loop, data loading, etc.), not inside a collective.
For NCCL BROADCAST/ALLREDUCE timeouts, calculate when the collective started:
start_time = timeout_time - timeout_ms (usually 1800000ms = 30min)
Connection reset across all files (total and per-file)AISBatchLoaderError or similar batch loader errors — these indicate AIStore returned fewer objects than requestedDetermine the causal chain: which error happened first, which are consequences. The pattern is usually:
Data loading error (root cause)
-> Some ranks crash out of training loop
-> Crashed ranks can't participate in NCCL collective
-> NCCL collective hangs for timeout period (usually 30min)
-> Watchdog kills all remaining ranks
NeMo has per-step collective operations that can cause rank desynchronization:
nemo/utils/callbacks/preemption.py): Calls torch.distributed.broadcast(interrupted, 0) at the end of EVERY training step via on_train_batch_end. If one rank's training step takes >30 min longer than others, this broadcast times out.nemo/utils/callbacks/nemo_model_checkpoint.py): Multiple trainer.strategy.broadcast() calls during checkpoint save/load.broadcast_buffers (DDP default=True): Broadcasts model buffers (e.g., batch norm stats) from rank 0 at each forward pass.If rank 0 is ahead of other ranks in NCCL SeqNum, check whether the gap matches the number of per-step collectives (PreemptionCallback broadcast + DDP allreduce + broadcast_buffers = ~3 ops per step).
If the analysis points to storage I/O issues (connection resets, data loading errors, timeouts) and the user did NOT provide AIS daemon logs, suggest downloading them. First check if the ais CLI is available (which ais). If it is, guide the user through:
Set the cluster endpoint (ask the user for the endpoint URL):
export AIS_ENDPOINT=https://<ais-cluster-endpoint>:<port>
Set the auth token (ask the user for the token value):
export AIS_AUTHN_TOKEN=<token>
Handle TLS — skip cert verification or point to the CA cert:
# Option A: skip verification
ais config cli set cluster.skip_verify_crt=true
# Option B: set CA cert
export AIS_SERVER_CRT=/path/to/ca.crt
Download all cluster logs to the current log directory:
ais log get cluster <path-to-worker-logs-dir>/ais_logs
This downloads TAR.GZ archives from all proxy and target nodes.
Extract and analyze per Phase 2 below.
If the ais CLI is not installed, it can be built from the AIStore repository (cd cmd/cli && go install .) or downloaded as a binary. Alternatively, ask the user to download the logs manually from the AIS cluster.
If tarballs (.tar.gz), extract them:
mkdir -p extracted && cd extracted
for f in ../*.tar.gz; do
name=$(basename "$f" .tar.gz)
mkdir -p "$name" && tar xzf "$f" -C "$name"
done
AIS log file naming and time ranges:
AIS daemon logs follow the naming convention:
aistarget.ais-target-N.INFO.MMDD-HHMMSS.1 # target logs
aisproxy.ais-proxy-N.INFO.MMDD-HHMMSS.1 # proxy logs
A single daemon (target or proxy) may have multiple log files — each file covers a specific time range:
MMDD-HHMMSS in the filename is the start time of that fileTo find the correct file for a failure window:
Always check ALL files for a daemon, not just the latest. The latest file may only cover minutes after a restart, while the failure evidence is in an earlier file.
Timezone verification: AIS daemon logs and worker (SLURM/torchrun) logs may be on different machines in different timezones. NEVER assume they match. To verify:
common:NNN DD Mon YY HH:MM UTC ============= — these explicitly state the timezone (typically UTC).YYYY-MM-DD HH:MM:SS, SLURM uses YYYY-MM-DDThh:mm:ss — but neither includes timezone by default.CANCELLED timestamp in worker logs, then find the exact moment get.n stops incrementing in AIS stats. If they align, the timezones match. If they're offset by a round number of hours, one system is in a different timezone.AIStore targets emit periodic stats lines (every ~3 min). Extract key counters around the failure time:
Critical counters to track:
err.get.n — GET errors (should be stable; spikes indicate problems)err.getbatch.n — batch GET errorserr.http.write.n — broken HTTP responses to clientserr.put.n, err.head.n, err.lst.n — other operation errorsget.n vs err.get.n — compute error rategetbatch.n, getbatch.obj.n — batch operation countsCompare values between consecutive stats lines to find the delta (new errors in that interval).
grep "^E " <logfile> # Error messages
grep "^W " <logfile> # Warning messages
Key error patterns in AIStore:
x-get-batch.*out-of-bounds index — batch GET lost objects during inter-target streamingshared-dm.*terminated.*broken pipe — inter-target data mover stream failureshared-dm: xid.*not found, dropping recv — objects dropped because batch job already abortedresource pressure: load=critical — target under disk/memory/CPU pressurelcache.*hk.*dsk=critical — disk at critical level, housekeeping skippedgc:.*free mem / oom: — memory pressure / forced GCThe proxy orchestrates batch GETs. Check proxy stats for:
err.get.n — should be near zero; high = proxy-level routing failureserr.http.write.n — proxy dropping client connectionsIf proxy errors are stable but target errors are spiking, the problem is target-side (disk, memory, inter-target networking).
x-get-batch failure mechanismThe typical AIStore batch GET failure chain:
1. Target under resource pressure (dsk=critical, mem=low)
2. Inter-target shared-dm streams break (broken pipe)
3. x-get-batch gets out-of-bounds index (recv'd len=0)
4. Batch job aborted, subsequent objects dropped (xid not found)
5. Client receives fewer objects than requested
6. Lhotse/client batch loader raises error (iterator exhausted prematurely)
Structure your output as:
Job Details — framework, scale, start time, data source
Timeline Table — chronological events with timestamps, source file:line
Root Cause Chain — numbered chain from trigger to final crash, with arrows
Key Files — which log files contain the critical evidence
Recommendations — actionable fixes (storage-side, client-side, training config)
Common root cause categories:
| Counter | Meaning |
|---|---|
err.get.n | Individual object GET failures on this target |
err.getbatch.n | Batch GET operation failures |
err.http.write.n | Failed HTTP writes to client (connection dropped) |
err.append.n | Append operation failures |
err.put.n | PUT operation failures |
err.head.n | HEAD (metadata) operation failures |
err.lst.n | List operation failures |
err.kalive.n | Keep-alive failures |
getbatch.n | Total batch GET operations |
getbatch.obj.n | Total objects served via batch GET |
getbatch.throttle.ns | Time spent throttling batch GETs |
Watchdog caught collective operation timeout — the NCCL watchdog detected a stuck collectiveSeqNum=N, OpType=BROADCAST/ALLREDUCE — which collective and sequence numberlast enqueued work: N, last completed work: M — work M completed, work M+1 is stuckTimeout(ms)=1800000 — 30-minute timeout (default)First PG on this rank to signal dumping — THIS rank initiated the cascadeObserved flight recorder dump signal from another rank — this rank is reacting to another's timeoutTo avoid data inconsistency, we are taking the entire process down — watchdog kills the processThe last enqueued work vs last completed work state is critical for determining whether the hang is in the CPU (data loading, training loop) or in the GPU (NCCL communication):
enqueued == completed (no pending ops): This rank has NO NCCL work in-flight. It is stuck in the CPU-side training loop (data loading, audio decoding, batch prep) and never entered the collective. This is the straggler — the rank that caused the hang.enqueued == completed + 1: The rank has submitted exactly one operation that hasn't completed. It entered the collective but can't complete because the straggler rank hasn't joined.enqueued > completed + 1: Multiple operations are queued — the CPU has progressed past the stuck point and submitted additional operations asynchronously (e.g., DDP gradient allreduce via hooks). Still waiting for the straggler.enqueued/completed than others: Rank 0 (often the broadcast root) completed its side of a collective (send) but receivers can't complete (receive) because the straggler hasn't joined.The key pattern for a data loading stall:
enqueued == completed, active collectives: 0, "Observed flight recorder dump signal from another rank" — this is the stragglerenqueued > completed, "failure detected by watchdog" — these are waiting for the stragglerThe key pattern for a GPU fabric issue:
enqueued > completed (all entered the collective), all show "First PG on this rank to signal dumping" — no straggler, the collective itself is brokenCompare enqueued counts across ALL ranks. Even ONE outlier changes the entire diagnosis.
Common data loading issues that cause single-rank stalls:
AudioSource._prepare_for_reading() calls f.read() with no Lhotse-level timeout. A stalled download blocks the DataLoader worker indefinitely.fault_tolerant=True silently drops failed audio: Failed audio files are skipped, reducing effective batch size per rank. Different ranks may have different failure rates depending on their shard assignments..m4a files lose extension in BytesIO: When audio is downloaded from URLs and wrapped in BytesIO, the file extension is lost. Lhotse's CompositeAudioBackend cannot use the fast-path for m4a (TorchaudioFFMPEGBackend) and falls through to expensive cascading backend trials.DfltMaxIdleTimeout). The Python SDK's urllib3 pool doesn't match this timeout, causing Connection reset by peer on stale pooled connections. These are caught and retried (always succeed on 1st retry) — they are noise, NOT a root cause.src[rank::world_size]. One rank may get shards with more corrupt files, larger audio, or slower storage targets.