with one click
cluster-outlier-detection
// Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis.
// Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis.
File Azure Guest Health Reports for node investigation or replacement. Complete impact category reference (26 categories), PhysicalHostName and Resource ID collection, REST API format, and insight polling.
Check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. Covers operstate, ibstat, pkey verification, link flap detection, and soft fixes.
Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results.
Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis.
Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR.
Test GPU compute performance using ubergemm GEMM benchmarks. Parse CSV output, identify underperforming GPUs, run fleet-wide analysis.
| name | cluster-outlier-detection |
| description | Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis. |
Statistical methods for identifying underperforming nodes from batch test results.
After running fleet-wide tests (GPU GEMM, NCCL per-rack, thermal), you have a set of per-node or per-rack metrics. Outlier detection finds nodes that are degraded relative to their peers, even if their absolute values are technically within tolerance.
Compare each node's metric against a fixed threshold from the SKU baseline.
if metric < threshold:
flag node
Pros: Simple, deterministic, directly actionable. Cons: Misses nodes that are degrading but not yet below the threshold. Does not adapt to fleet conditions.
Use the thresholds from sku_performance_baseline for pass/fail decisions.
Compute fleet mean and standard deviation, then flag nodes more than N standard deviations below the mean.
mean = average(all_node_metrics)
stdev = standard_deviation(all_node_metrics)
z_score = (node_metric - mean) / stdev
if z_score < -2.0:
flag as outlier
| Z-score | Percentile | Action |
|---|---|---|
| < -1.5 | ~7th percentile | Monitor — performance is below peers |
| < -2.0 | ~2nd percentile | Investigate — likely degraded |
| < -3.0 | ~0.1th percentile | Drain — almost certainly hardware issue |
Pros: Adapts to actual fleet performance. Catches relative degradation. Cons: Requires enough data points (≥ 10 nodes). Sensitive to outliers in the dataset itself (one very bad node inflates stdev).
For small fleets or fleets with known bad nodes:
median = median(all_node_metrics)
MAD = median(|metric - median| for each node)
modified_z = 0.6745 * (node_metric - median) / MAD
if modified_z < -2.0:
flag as outlier
MAD (Median Absolute Deviation) is less sensitive to extreme outliers than standard deviation.
Compare each node against the expected value for the SKU, expressed as percentage deviation.
deviation_pct = (expected - node_metric) / expected * 100
if deviation_pct > warn_pct:
flag as warning (e.g., > 3.5%)
if deviation_pct > ghr_pct:
flag for GHR (e.g., > 7%)
This is what the GPU GEMM analysis uses (see node_gpu_validation skill).
sku_performance_baseline.nccl_performance_diagnosis).When presenting outlier results, include: