Skip to main content
Jeden Skill in Manus ausführen
mit einem Klick

hyperpod-node-debugger

// Diagnose and remediate per-node issues on a HyperPod cluster (EKS or Slurm) — a specific node is unhealthy, unresponsive, stuck, or needs replacing. Covers on-node EFA, GPU / accelerator hardware (XID, ECC, NVLink, row-remap, DCGM), Slurm node down/drained, disk and memory pressure, per-node lifecycle-script failures, SSM agent, container runtime, kernel panics, pod networking. Read-only. Not for cluster-wide provisioning (→ hyperpod-cluster-debugger), NCCL (→ hyperpod-nccl), or MFU (→ hyperpod-mfu-debugger).

$ git log --oneline --stat
stars:765
forks:107
updated:16. Mai 2026 um 23:28
Datei-Explorer
7 Dateien
SKILL.md
readonly