Skip to main content
تشغيل أي مهارة في Manus
بنقرة واحدة

hyperpod-nccl

// Diagnose NCCL failures and adjacent training-pod failures on HyperPod GPU clusters (EKS or Slurm) — training hangs, AllReduce / collective-op timeouts, EFA or libfabric errors, rendezvous failures, EFA TCP fallback, /dev/shm or memlock issues, NCCL version mismatch across pods, container OOM / exit-137 / OOMKilled, GPU OOM (CUDA out of memory), CrashLoopBackOff / Pending pods, MASTER_ADDR DNS, NetworkPolicy blocking. Not for single-node hardware faults (→ hyperpod-node-debugger § G) or cluster-creation EFA / SSM failures (→ hyperpod-cluster-debugger § A / § F).

$ git log --oneline --stat
stars:٧٦٥
forks:١٠٧
updated:١٦ مايو ٢٠٢٦ في ٢٣:٢٨
مستكشف الملفات
6 ملفات
SKILL.md
readonly