ワンクリックで
node-drain-and-replace
// Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR.
// Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR.
File Azure Guest Health Reports for node investigation or replacement. Complete impact category reference (26 categories), PhysicalHostName and Resource ID collection, REST API format, and insight polling.
Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis.
Check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. Covers operstate, ibstat, pkey verification, link flap detection, and soft fixes.
Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results.
Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis.
Test GPU compute performance using ubergemm GEMM benchmarks. Parse CSV output, identify underperforming GPUs, run fleet-wide analysis.
| name | node-drain-and-replace |
| description | Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR. |
Slurm node lifecycle management: when and how to drain, undrain, reboot, and file for replacement.
| State | Meaning |
|---|---|
idle | Available for jobs |
allocated | Running a job |
mixed | Some CPUs/GPUs allocated, some free |
drained | Administratively removed from scheduling; no new jobs |
draining | Drained but still running existing job(s) |
down | Node is unreachable or failed healthcheck |
down* | Node is down with a reason |
sudo scontrol update NodeName=ccw-gpu-5 State=DRAIN Reason="IB_port_down_20250115"
Always include a dated reason. Format: <issue>_<YYYYMMDD>. This creates an audit trail so you know why nodes were drained and when.
sudo scontrol update NodeName=ccw-gpu-[5-8] State=DRAIN Reason="NCCL_low_mnnvl_20250115"
sinfo -R
Return a drained node to service:
sudo scontrol update NodeName=ccw-gpu-5 State=RESUME Reason="fixed_after_reboot"
Critical: Actually run this command. Just saying the node is undrained doesn't make it so.
sinfo -N -n ccw-gpu-5 -o "%N %T"
Should show idle (or allocated if a job grabbed it immediately).
Issue Detected
│
├─ FabricManager error / XID 79 / XID 95?
│ └─ YES → Drain → Collect metadata → File GHR (skip reboot)
│
├─ IB port down?
│ ├─ Try soft fix: sudo ip link set ibX up
│ ├─ If soft fix works → restart healthagent → verify → undrain
│ └─ If soft fix fails → reboot → check → if still down → Drain + GHR
│
├─ GPU performance degraded?
│ ├─ Re-test to confirm (not transient)
│ ├─ Check nvidia-smi -q for throttling, ECC errors
│ ├─ Run dcgmi diag -r 1 for quick validation
│ ├─ If persistent → reboot → re-test
│ └─ If still degraded after reboot → Drain + GHR
│
├─ NCCL bandwidth low (one rack)?
│ ├─ Bisect to find the bad node (see nccl_performance_diagnosis)
│ ├─ Drain the bad node
│ ├─ Investigate the bad node (GPU test, IB check, healthcheck)
│ └─ File GHR if issue persists after reboot
│
├─ Thermal test failure?
│ ├─ Reboot → re-test
│ └─ If still fails → Drain + GHR (category: gpu_throttling or dcgm_failure)
│
└─ Unknown / general issue?
├─ Run healthcheck: sudo /usr/bin/health
├─ Check dmesg for errors
├─ Reboot → re-check
└─ If unresolved → Drain + GHR (category: HpcGenericFailure)
# On the target node, save physical hostname and resource ID
# See azure_node_health_report skill for commands
This is critical — if you reboot first and the node doesn't come back, you won't have the data needed for a GHR.
# From scheduler, via SSH to the node
ssh ccw-gpu-5 'sudo reboot'
Poll until the node is reachable (typically 2–3 minutes):
# Simple poll loop
for i in $(seq 1 20); do
ssh -o ConnectTimeout=5 ccw-gpu-5 uptime 2>/dev/null && break
echo "Waiting... ($i)"
sleep 15
done
# Check healthagent
ssh ccw-gpu-5 'sudo /usr/bin/health'
# Check IB interfaces directly (healthagent may have stale data)
ssh ccw-gpu-5 'for i in ib0 ib1 ib2 ib3; do echo "$i: $(cat /sys/class/net/$i/operstate 2>/dev/null || echo missing)"; done'
# Check GPUs
ssh ccw-gpu-5 'nvidia-smi -L'
# Check NVLink
ssh ccw-gpu-5 'nvidia-smi nvlink -s 2>&1 | head -20'
Real commands show everything OK but healthagent still reports failure:
ssh ccw-gpu-5 'sudo systemctl restart healthagent && sleep 5 && sudo /usr/bin/health'
When Azure processes a GHR and replaces/repairs the physical hardware:
After bisection identifies a rack-level issue:
# Get all nodes with a specific ClusterUUID
RACK_NODES="ccw-gpu-[1-18]"
sudo scontrol update NodeName=$RACK_NODES State=DRAIN Reason="rack_nvswitch_failure_20250115"
sudo scontrol update NodeName=ccw-gpu-[1-18] State=RESUME Reason="validated_after_repair"
sinfo -t drain,drained -N -o "%N %T %E"
sinfo -p gpu -h -o "%T" | sort | uniq -c | sort -rn