원클릭으로
nccl-allreduce-test
// Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results.
// Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results.
File Azure Guest Health Reports for node investigation or replacement. Complete impact category reference (26 categories), PhysicalHostName and Resource ID collection, REST API format, and insight polling.
Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis.
Check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. Covers operstate, ibstat, pkey verification, link flap detection, and soft fixes.
Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis.
Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR.
Test GPU compute performance using ubergemm GEMM benchmarks. Parse CSV output, identify underperforming GPUs, run fleet-wide analysis.
| name | nccl-allreduce-test |
| description | Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results. |
How to run NCCL all_reduce_perf bandwidth tests, configure environment variables per SKU, and interpret results.
Scripts: This skill references test scripts from the Azure/ai-infrastructure-on-azure repo. Clone it and run from the repo root.
/opt/nccl-tests/build/all_reduce_perf
This is the standard NCCL test binary from nccl-tests. It measures collective bandwidth across GPUs and nodes.
The launcher script is at infrastructure_validations/slurm/NCCL/nccl_test.sh. It loads per-SKU configs and handles sbatch submission.
cd infrastructure_validations/slurm/NCCL
# Full sweep — GB300, 4 nodes
./nccl_test.sh --sku graceblackwell -N 4
# Full sweep — H100, 8 nodes
./nccl_test.sh --sku hopper -N 8 -w ccw-gpu-[1-8]
# Quick bandwidth check — large messages only, 10 iterations
./nccl_test.sh --sku graceblackwell --begin-size 16G --end-size 16G --iters 10 -N 18
# Auto-detect SKU from nodelist
./nccl_test.sh -N 4 -w ccw-gpu-[1-4]
| Option | Default | Description |
|---|---|---|
--sku NAME | auto-detect | Config name: graceblackwell or hopper |
--begin-size SIZE | 1K | Start message size |
--end-size SIZE | 16G | End message size |
--iters N | nccl default | Iterations per message size |
--check | off | Enable data correctness validation |
All other arguments pass through to sbatch (e.g., -N 4, -w nodelist).
Config file: configs/graceblackwell.conf
Key settings:
NCCL_MNNVL_ENABLE=1, NCCL_NVLS_ENABLE=1)NCCL_DMABUF_ENABLE=1)NCCL_SHM_DISABLE=1) — NVLink is fasterNCCL_IB_SL=1) — required for Azure NDR fabricNCCL_NET_GDR_C2C=1)Config file: configs/hopper.conf
Key settings:
NCCL_TOPO_FILE=/opt/microsoft/ndv5-topo.xmlNCCL_PXN_DISABLE=1)NCCL_MIN_NCHANNELS=32)NCCL_COLLNET_ENABLE=1)UCX_TLS=rc)# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
0 0 float sum -1 0.02 0.00 0.00 0 0.01 0.00 0.00 0
1024 256 float sum -1 17.94 0.06 0.11 0 17.94 0.06 0.11 0
...
17179869184 4294967296 float sum -1 18285.0 939.58 936.93 0 18292.6 939.19 936.54 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 487.265
| Mode | Begin | End | Iters | Duration | Purpose |
|---|---|---|---|---|---|
| Quick check | 16G | 16G | 10 | ~2 min | Validate peak bandwidth |
| Full sweep | 1K | 16G | default | ~15-30 min | Profile across all sizes, detect small-message regressions |
| Bisection test | 8G | 16G | 20 | ~5 min | Balance speed and confidence during fault isolation |
See sku_performance_baseline skill for per-SKU busbw targets.
| Observation | What It Means |
|---|---|
| busbw near zero | NCCL could not establish communication — check IB links, pkeys |
| busbw < 50 % of expected | Likely a bad node dragging down the collective |
| #wrong > 0 | Data corruption — hardware fault, file GHR immediately |
| Job hangs (no output growth) | NCCL initialization stuck — likely a downed IB link or pkey mismatch |
| "NCCL WARN" in output about IB | IB fabric issue — check ibstat on all nodes |