Skip to main content
在 Manus 中运行任何 Skill
一键导入
$pwd:

hyperpod-node-debugger

// Diagnose and remediate per-node issues on a HyperPod cluster (EKS or Slurm) — a specific node is unhealthy, unresponsive, stuck, or needs replacing. Covers on-node EFA, GPU / accelerator hardware (XID, ECC, NVLink, row-remap, DCGM), Slurm node down/drained, disk and memory pressure, per-node lifecycle-script failures, SSM agent, container runtime, kernel panics, pod networking. Read-only. Not for cluster-wide provisioning (→ hyperpod-cluster-debugger), NCCL (→ hyperpod-nccl), or MFU (→ hyperpod-mfu-debugger).

$ git log --oneline --stat
stars:765
forks:107
updated:2026年5月16日 23:28
文件资源管理器
7 个文件
SKILL.md
readonly