Run any Skill in Manus with one click

$pwd:

hyperpod-nccl

Name: Hyperpod Nccl
Author: awslabs

// Diagnose NCCL failures and adjacent training-pod failures on HyperPod GPU clusters (EKS or Slurm) — training hangs, AllReduce / collective-op timeouts, EFA or libfabric errors, rendezvous failures, EFA TCP fallback, /dev/shm or memlock issues, NCCL version mismatch across pods, container OOM / exit-137 / OOMKilled, GPU OOM (CUDA out of memory), CrashLoopBackOff / Pending pods, MASTER_ADDR DNS, NetworkPolicy blocking. Not for single-node hardware faults (→ hyperpod-node-debugger § G) or cluster-creation EFA / SSM failures (→ hyperpod-cluster-debugger § A / § F).

Run Skill in Manus

$ git log --oneline --stat

stars:765

forks:107

updated:May 16, 2026 at 23:28

File Explorer

6 files

SKILL.md

readonly

related-skills.json

same repository

model-evaluation.md

from "awslabs/agent-plugins"

Generates python code that evaluates SageMaker models. Supports two evaluation types: LLM-as-Judge and Custom Scorer. Use when the user says "evaluate my model", "test model performance", "how did my model perform", "compare models", or other similar requests.

2026-05-26765

hyperpod-cluster-debugger.md

from "awslabs/agent-plugins"

Diagnose and remediate cluster-wide HyperPod (EKS or Slurm) problems — creation / deployment failures (CloudFormation, EFA health check, lifecycle scripts, capacity), EKS access, node replacement, CloudFormation nested-stack errors, post-maintenance rollback state, dangling nodes, autoscaler conflicts. Includes `--validate` pre-flight. Read-only.

2026-05-16765

hyperpod-node-debugger.md

from "awslabs/agent-plugins"

Diagnose and remediate per-node issues on a HyperPod cluster (EKS or Slurm) — a specific node is unhealthy, unresponsive, stuck, or needs replacing. Covers on-node EFA, GPU / accelerator hardware (XID, ECC, NVLink, row-remap, DCGM), Slurm node down/drained, disk and memory pressure, per-node lifecycle-script failures, SSM agent, container runtime, kernel panics, pod networking. Read-only. Not for cluster-wide provisioning (→ hyperpod-cluster-debugger), NCCL (→ hyperpod-nccl), or MFU (→ hyperpod-mfu-debugger).

2026-05-16765

hyperpod-performance-debugger.md

from "awslabs/agent-plugins"

Diagnose performance issues on Amazon SageMaker HyperPod clusters — uneven NCCL bandwidth across nodes and poor filesystem throughput. Read-only. Surfaces host-side signals (Xid, ECC, NVLink, EFA reachability, FSx saturation) and routes to the appropriate sibling skill (hyperpod-node-debugger, hyperpod-nccl, hyperpod-version-checker, hyperpod-issue-report) for any remediation. Triggers on uneven NCCL across nodes, straggler node, FSx slow, checkpoint slow, dataloader slow, filesystem bottleneck, FSx throughput, cross-AZ latency, topology mismatch.

2026-05-16765

hyperpod-slurm-debugger.md

from "awslabs/agent-plugins"

Diagnostic-only skill for Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod Slurm clusters. Scope mirrors the HyperPod troubleshooting guide. Invoke when the user reports a Slurm node stuck in down/drain, "Node unexpectedly rebooted" after auto-repair, slurmd not running, jobs stuck PENDING with REASON=Resources while sinfo shows idle nodes, jobs stuck COMPLETING after node replacement, GRES/GPU counts wrong, scontrol ping failing, slurmctld unresponsive, an Action:Reboot/Replace request that did not trigger HyperPod auto-recovery, or auto-resume not restarting a job. Also triggers on "drain before reboot", "diagnose a Slurm node", "investigate stuck jobs."

2026-05-16765

hyperpod-ssm.md

from "awslabs/agent-plugins"

Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.

2026-05-16765

package.json

"author": "awslabs"

"repository": "awslabs/agent-plugins"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	hyperpod-nccl
description	Diagnose NCCL failures and adjacent training-pod failures on HyperPod GPU clusters (EKS or Slurm) — training hangs, AllReduce / collective-op timeouts, EFA or libfabric errors, rendezvous failures, EFA TCP fallback, /dev/shm or memlock issues, NCCL version mismatch across pods, container OOM / exit-137 / OOMKilled, GPU OOM (CUDA out of memory), CrashLoopBackOff / Pending pods, MASTER_ADDR DNS, NetworkPolicy blocking. Not for single-node hardware faults (→ hyperpod-node-debugger § G) or cluster-creation EFA / SSM failures (→ hyperpod-cluster-debugger § A / § F).
metadata	{"version":"0.0.1"}

HyperPod NCCL Debugger

Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Never discard training state on speculation.

Diagnose NCCL failures on SageMaker HyperPod (EKS and Slurm). scripts/nccl-diagnose.sh reads state via AWS APIs, kubectl, and SSM, then prints each issue as [FAIL] ... → references/<file>.md § <section>. Read-only.

Signal sourcing: list-cluster-events carries infrastructure-level state only (lifecycle, bootstrap, EFA health check, capacity, replacement, reboot, AMI rollback). It does not carry NCCL timeouts, GPU XID/ECC, or per-pod training signals — those come from pod logs, CloudWatch training streams, on-node SSM probes, and NCCL env audit. "No events" on a training-time NCCL issue is expected, not a clean bill of health.

Workflow

Collect cluster name, region, namespace/job (EKS), exact NCCL error string.
Run the diagnostic (always — the output drives everything else).
For every [FAIL] line, Read the referenced section.
Present finding, root cause, and the Suggested-command block with concrete values (instance IDs, SG IDs, namespaces) filled in from the script output. Wait for customer approval.
Re-run the diagnostic to confirm.

If a finding has no matching section, report it as a bug — do not invent a fix.

Step 1: Authenticate kubectl (EKS)

EKS_ARN=$(aws sagemaker describe-cluster --cluster-name <HYPERPOD-NAME> --region <REGION> \
  --query 'Orchestrator.Eks.ClusterArn' --output text)
EKS_NAME=$(echo "$EKS_ARN" | awk -F'/' '{print $NF}')
aws eks update-kubeconfig --name "$EKS_NAME" --region <REGION>
kubectl get nodes

Step 2: Run the diagnostic

# Basic:
bash scripts/nccl-diagnose.sh --cluster <HYPERPOD-NAME> --region <REGION>

# Scope to an EKS job/namespace:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --namespace <NS> --job <JOB>

# Force orchestrator:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --orchestrator slurm

# Larger hardware sample (default 3):
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --sample-nodes 10

# Specific node only:
bash scripts/nccl-diagnose.sh --cluster <NAME> --region <REGION> --node i-0abc123def456

Tags: [PASS] · [FAIL] (counted in Issues Found, has reference pointer) · [WARN] · [INFO]. Priorities: P0 blocks training · P1 degraded · P2 informational.

Remediation index

Each [FAIL] line in the script already points directly at the right section. This table is a lookup for manual triage.

Finding	Section
SG missing inbound/outbound self-reference	operations.md § 8
Blocking NetworkPolicy / allow-all missing	operations.md § 8
Slurm node DOWN / DRAINING / RemoveIPC	operations.md § 7
GPU XID / SYSTEM_ERROR / hardware fault	hyperpod-node-debugger § F / § G
GPU row-remap / DCGM Fail / silent NaNs	hyperpod-node-debugger § G.1.a/b
NCCL timeout / rendezvous / straggler	debugging-guide.md § 1
EFA configuration / not used	debugging-guide.md § 6
EFA TCP fallback (`NET/OFI Using TCP`)	debugging-guide.md § 13
NCCL version mismatch across pods	debugging-guide.md § 10
Container OOM (pod killed, exit 137)	debugging-guide.md § 4
GPU OOM (`CUDA out of memory`)	debugging-guide.md § 11
RDMA memlock / `/dev/shm` too small	debugging-guide.md § 17
MASTER_ADDR DNS / headless Service	debugging-guide.md § 12
NVLS / PXN / topology tuning	debugging-guide.md § 19
Any NCCL / EFA / rendezvous log pattern	error-patterns-quick-ref.md
Performance / nccl-tests / bandwidth	performance-testing.md

Prerequisites

aws CLI v2.13+ authenticated (aws sts get-caller-identity)
jq, python3, bash 4.2+
unbuffer (from the expect package: yum install expect / apt install expect)
kubectl authenticated to the EKS cluster (K8s checks skipped if absent)
session-manager-plugin for on-node hardware checks

Defaults

Region — required: pass --region or set $AWS_DEFAULT_REGION.
Orchestrator — auto-detected; override with --orchestrator eks|slurm.
Namespace / job (EKS) — all namespaces; scope with --namespace <NS> --job <JOB>.
Hardware sampling — 3 nodes over SSM (capped at 50). --node <ID> for a specific node. Node probes run serially (180 s per node): --sample-nodes 10 can take ~30 min.
CloudWatch window — last 2 hours.
Colors — auto-disabled on non-TTY or TERM=dumb.

Error handling

Failure	Script	Tell the customer
`aws sts get-caller-identity` fails	Exit 1 with the AWS error	"Fix AWS credentials and rerun."
`describe-cluster` AccessDenied	Warn, add `Missing IAM for sagemaker:DescribeCluster`	"Grant `sagemaker:DescribeCluster` (operations.md § 2)."
Cluster not found	Exit 1 after listing region's clusters	"Confirm HyperPod cluster name and region."
`kubectl` absent / unauthenticated	Warn, skip K8s checks	"`aws eks update-kubeconfig --name <EKS> --region <R>`."
SSM plugin absent	Warn, skip on-node hardware checks	"Install session-manager-plugin."
SSM times out (180s)	Partial output, mark node unreachable	"Rerun with `--node <ID> --sample-nodes 1`; check SSM agent on the node."
CloudWatch log group not found	Skip CloudWatch scan	"Enable CloudWatch on the cluster (operations.md § 4)."
Cluster events API throttled	Warn, continue with partial data	"Rerun later — script is idempotent."

Exit codes: 0 diagnostic complete · 1 fatal prerequisite missing or cluster unreachable.

IAM permissions

Full policy + RBAC in operations.md § 2. SSM on HyperPod uses start-session against sagemaker-cluster:<cluster-id>_<group>-<iid> targets — grant ssm:StartSession / ssm:TerminateSession, not ssm:SendCommand.

Scale strategy

Scope	Method	Coverage
All nodes	`sagemaker:ListClusterNodes` (paginated)	100% nodes
All K8s objects	`kubectl`	100% pods/nodes/policies
Hardware	SSM `--sample-nodes N` (default 3)	Sampled
Node logs	CloudWatch	100% nodes

Large clusters: the PyTorch NCCL backend defaults to a 10-minute collective-op timeout (per the PyTorch distributed docs). Large clusters routinely exceed that on first rendezvous; raise it via torch.distributed.init_process_group(timeout=timedelta(seconds=<N>)). HyperPod support has also observed NCCL topology-graph-search hangs on 256+ node clusters when memlock is unlimited; using a large fixed memlock (e.g. 8388608) in pod securityContext or /etc/security/limits.conf has cleared these in field cases. This memlock pattern is a field observation, not AWS- or NCCL-documented behavior.

For FSDP, DeepSpeed, or Megatron-LM tuning: debugging-guide.md § 18.

Skill delegation

Need	Use
Cluster creation / deployment failures	`hyperpod-cluster-debugger` (§ A / B / C / H + `--validate`)
Post-deployment cluster-wide management	`hyperpod-cluster-debugger`
Per-node issues (disk, lifecycle, hardware)	`hyperpod-node-debugger`
Trainium/Inferentia collective-comm (AWS Neuron Collectives, not NCCL)	`hyperpod-node-debugger` § G.2
Shell on nodes	`hyperpod-ssm`
Version comparison across nodes	`hyperpod-version-checker`
Diagnostic bundle for AWS Support	`hyperpod-issue-report`
MFU / performance degradation	`hyperpod-mfu-debugger`

Escalate to AWS Support

Escalate when:

All SG rules correct, EFA verified on-node, but NCCL still times out.
Hardware checks pass on all nodes but AllReduce still hangs.
Issues Found: 0 but training still fails.
GPU XID errors persist after node replacement.
Collective-op timeout raised and memlock workaround applied but large-cluster rendezvous still hangs.

Before opening the case

# 1. Cluster identity + status
aws sagemaker describe-cluster --cluster-name <C> --region <R>

# 2. Full NCCL diagnostic (sample more nodes for escalation)
bash scripts/nccl-diagnose.sh --cluster <C> --region <R> --sample-nodes 10 > nccl-diag.txt

# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report)
#    See skills/hyperpod-issue-report/SKILL.md for the exact invocation.

Include in the case

Cluster name + ARN and AWS region
Orchestrator (EKS or Slurm) and EKS cluster name / Slurm controller node
Timestamp window (UTC start / end) of the failure
Exact NCCL / libfabric error strings (copy verbatim from pod logs or journalctl)
Affected instance IDs / node names / pod names / namespace / job name
nccl-diag.txt from step 2 above
S3 URI of the hyperpod-issue-report bundle from step 3
NCCL env vars in effect (printenv | grep -E '^NCCL|^FI_|^TORCH_' from one pod)

References

error-patterns-quick-ref.md — log pattern → code → fix table
debugging-guide.md — per-scenario procedures (21 sections incl. NVLS/PXN/topology)
performance-testing.md — nccl-tests, bandwidth thresholds, straggler detection
operations.md — IAM, SSM format, CloudWatch, env-var reference, node labels, Slurm ops, remediations

hyperpod-nccl

More from this repository

More from this repository

HyperPod NCCL Debugger

Workflow

Step 1: Authenticate kubectl (EKS)

Step 2: Run the diagnostic

Remediation index

Prerequisites

Defaults

Error handling

IAM permissions

Scale strategy

Skill delegation

Escalate to AWS Support

Before opening the case

Include in the case

References

HyperPod NCCL Debugger

Workflow

Step 1: Authenticate kubectl (EKS)

Step 2: Run the diagnostic

Remediation index

Prerequisites

Defaults

Error handling

IAM permissions

Scale strategy

Skill delegation

Escalate to AWS Support

Before opening the case

Include in the case

References