一键导入
rack-topology
// MNNVL domain discovery on Azure GB300 clusters. ClusterUUID lookup via nvidia-smi, expected rack sizes per SKU, FabricManager troubleshooting.
// MNNVL domain discovery on Azure GB300 clusters. ClusterUUID lookup via nvidia-smi, expected rack sizes per SKU, FabricManager troubleshooting.
File Azure Guest Health Reports for node investigation or replacement. Complete impact category reference (26 categories), PhysicalHostName and Resource ID collection, REST API format, and insight polling.
Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis.
Check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. Covers operstate, ibstat, pkey verification, link flap detection, and soft fixes.
Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results.
Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis.
Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR.
| name | rack-topology |
| description | MNNVL domain discovery on Azure GB300 clusters. ClusterUUID lookup via nvidia-smi, expected rack sizes per SKU, FabricManager troubleshooting. |
How MNNVL domains work on Azure GB300 clusters, how to discover rack membership, and expected rack structure per SKU.
Scripts: This skill references test scripts from the Azure/ai-infrastructure-on-azure repo. Clone it and run from the repo root.
On GB300 (NDv6) clusters, nodes within a physical rack are connected via NVSwitch/NVLink in an MNNVL (Multi-Node NVLink) domain. This gives intra-rack bandwidth of ~900+ GB/s for allreduce operations — far higher than the ~200 GB/s available over InfiniBand between racks.
Each MNNVL domain has a unique ClusterUUID reported by nvidia-smi. All nodes sharing the same ClusterUUID are in the same physical rack and can use NVLink for communication.
a1b2c3d4-e5f6-7890-abcd-ef1234567890)nvidia-smi -q | grep ClusterUUID
Output:
ClusterUUID : a1b2c3d4-e5f6-7890-abcd-ef1234567890
# From the scheduler node
parallel-ssh -H "ccw-gpu-1 ccw-gpu-2 ccw-gpu-3 ..." -t 15 -i \
"nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'"
Output:
[1] 14:23:45 [SUCCESS] ccw-gpu-1
a1b2c3d4-e5f6-7890-abcd-ef1234567890
[2] 14:23:45 [SUCCESS] ccw-gpu-2
a1b2c3d4-e5f6-7890-abcd-ef1234567890
[3] 14:23:46 [SUCCESS] ccw-gpu-19
b2c3d4e5-f6a7-8901-bcde-f12345678901
Group nodes by UUID to get rack membership.
Using Slurm hostlist expansion and parallel SSH:
# Get all nodes in the GPU partition
NODES=$(sinfo -p gpu -h -N -o '%N' | sort -u | tr '\n' ' ')
# Query ClusterUUID from all nodes
parallel-ssh -H "$NODES" -t 15 -i \
"nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'"
nvlink_down.After discovery, verify each rack has the expected number of nodes:
| SKU | Expected Rack Size |
|---|---|
| GB300 (NDv6) | 18 nodes |
If a rack has fewer than expected nodes:
idle or allocated state but didn't return a ClusterUUID, investigate those nodes.Test each rack independently to validate intra-rack NVLink bandwidth:
# For each rack, run NCCL test on its nodes
./nccl_test.sh --sku graceblackwell -N 18 -w ccw-gpu-[1-18]
Expected busbw: ~937 GB/s at 16 G message size.
Pick one node from each rack and test across racks:
# One node per rack, testing IB fabric
./nccl_test.sh --sku graceblackwell -N 4 -w ccw-gpu-1,ccw-gpu-19,ccw-gpu-37,ccw-gpu-55
Use IB-only NCCL settings (disable MNNVL) for pure IB measurement.
For training jobs, prefer allocating full racks (or multiples of racks) to maximize MNNVL utilization. Incomplete rack allocation wastes NVLink bandwidth and forces more traffic over IB.
NVLink/MNNVL requires NVIDIA FabricManager to be running:
systemctl status nvidia-fabricmanager
Healthy output includes Active: active (running).
Common FabricManager issues:
nvlink_down.sudo systemctl restart nvidia-fabricmanager. If it won't start, GHR.dcgmi discovery -l | grep -i nvswitch to check NVSwitch visibility.