Skip to main content
Ejecuta cualquier Skill en Manus
con un clic

hyperpod-node-debugger

// Diagnose and remediate per-node issues on a HyperPod cluster (EKS or Slurm) — a specific node is unhealthy, unresponsive, stuck, or needs replacing. Covers on-node EFA, GPU / accelerator hardware (XID, ECC, NVLink, row-remap, DCGM), Slurm node down/drained, disk and memory pressure, per-node lifecycle-script failures, SSM agent, container runtime, kernel panics, pod networking. Read-only. Not for cluster-wide provisioning (→ hyperpod-cluster-debugger), NCCL (→ hyperpod-nccl), or MFU (→ hyperpod-mfu-debugger).

$ git log --oneline --stat
stars:765
forks:107
updated:16 de mayo de 2026, 23:28
Explorador de archivos
7 archivos
SKILL.md
readonly