com um clique
nccl-performance-diagnosis
// Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis.
// Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis.
File Azure Guest Health Reports for node investigation or replacement. Complete impact category reference (26 categories), PhysicalHostName and Resource ID collection, REST API format, and insight polling.
Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis.
Check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. Covers operstate, ibstat, pkey verification, link flap detection, and soft fixes.
Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results.
Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR.
Test GPU compute performance using ubergemm GEMM benchmarks. Parse CSV output, identify underperforming GPUs, run fleet-wide analysis.
| name | nccl-performance-diagnosis |
| description | Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis. |
How to analyze NCCL bandwidth results, identify what type of failure is occurring, and isolate the bad node(s).
Scripts: This skill references test scripts from the Azure/ai-infrastructure-on-azure repo. Clone it and run from the repo root.
When NCCL bandwidth is below the expected baseline, work through these levels:
If a per-rack NCCL test (using all nodes in one MNNVL domain) shows low bandwidth:
If cross-rack NCCL tests show low bandwidth:
ib_link_validation skill).If all inter-node tests are fine but a single node shows issues:
Bisection isolates the bad node(s) from a failing group by repeatedly splitting and testing.
When testing 2–3 suspects individually, pair each with a different known-good node and run all pairs as separate jobs simultaneously. This avoids serializing the final isolation step.
Example with 3 suspects (S1, S2, S3) and known-good nodes (G1, G2, G3):
Test 1: [S1, G1] → FAIL → S1 is bad
Test 2: [S2, G2] → PASS → S2 is good
Test 3: [S3, G3] → FAIL → S3 is bad
Important: Use a different good node for each pair to avoid the good node being a bottleneck or correlating failures.
Once the bad node is identified, determine whether the issue is GPU or network:
nvidia-smi nvlink -s shows inactive or degraded NVLink connections.dmesg shows XID errors.dcgmi diag -r 1 fails.ibstat shows a port down or in Polling state.ib_link_validation skill).nvidia-smi -q shows ClusterUUID: 00000000-0000-0000-0000-000000000000 (NVLink fabric not initialized).systemctl status nvidia-fabricmanager.nvidia-smi nvlink -e.| Pattern | Likely Cause |
|---|---|
| busbw ~50 % of expected | One bad node in a 2-node test |
| busbw ~0 | NCCL cannot communicate — IB link down or pkey issue |
| busbw normal at small sizes, drops at large sizes | Congestion or IB bandwidth limit |
| busbw varies across runs (±20 %) | Transient issue — noisy neighbor, thermal throttle, or IB congestion |
| All racks fail | Cluster-wide issue — check switch, SM, or subnet manager |
| One rack fails, others pass | Rack-level issue — NVSwitch, TOR switch, or power |
| Scenario | Test Approach |
|---|---|
| Initial validation of a new cluster | Full sweep (1K–16G) on full rack |
| Routine daily check | Quick check (16G, 10 iters) per rack |
| After node replacement | Quick check on affected rack |
| Investigating a user-reported slow job | Quick check on the job's nodelist |
| Bad rack found | Bisect within that rack |
infrastructure_validations/slurm/NCCL/nccl_test.shinfrastructure_validations/slurm/NCCL/configs/infrastructure_validations/slurm/gpu_test/gpu_test.slurmib_link_validation skillsku_performance_baseline skill