Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

hyperpod-cluster-debugger

Name: Hyperpod Cluster Debugger
Author: awslabs

// Diagnose and remediate cluster-wide HyperPod (EKS or Slurm) problems — creation / deployment failures (CloudFormation, EFA health check, lifecycle scripts, capacity), EKS access, node replacement, CloudFormation nested-stack errors, post-maintenance rollback state, dangling nodes, autoscaler conflicts. Includes `--validate` pre-flight. Read-only.

Ejecutar en Manus

$ git log --oneline --stat

stars:765

forks:107

updated:16 de mayo de 2026, 23:28

Explorador de archivos

8 archivos

SKILL.md

readonly

related-skills.json

mismo repositorio

model-evaluation.md

from "awslabs/agent-plugins"

Generates python code that evaluates SageMaker models. Supports two evaluation types: LLM-as-Judge and Custom Scorer. Use when the user says "evaluate my model", "test model performance", "how did my model perform", "compare models", or other similar requests.

2026-05-26765

hyperpod-nccl.md

from "awslabs/agent-plugins"

Diagnose NCCL failures and adjacent training-pod failures on HyperPod GPU clusters (EKS or Slurm) — training hangs, AllReduce / collective-op timeouts, EFA or libfabric errors, rendezvous failures, EFA TCP fallback, /dev/shm or memlock issues, NCCL version mismatch across pods, container OOM / exit-137 / OOMKilled, GPU OOM (CUDA out of memory), CrashLoopBackOff / Pending pods, MASTER_ADDR DNS, NetworkPolicy blocking. Not for single-node hardware faults (→ hyperpod-node-debugger § G) or cluster-creation EFA / SSM failures (→ hyperpod-cluster-debugger § A / § F).

2026-05-16765

hyperpod-node-debugger.md

from "awslabs/agent-plugins"

Diagnose and remediate per-node issues on a HyperPod cluster (EKS or Slurm) — a specific node is unhealthy, unresponsive, stuck, or needs replacing. Covers on-node EFA, GPU / accelerator hardware (XID, ECC, NVLink, row-remap, DCGM), Slurm node down/drained, disk and memory pressure, per-node lifecycle-script failures, SSM agent, container runtime, kernel panics, pod networking. Read-only. Not for cluster-wide provisioning (→ hyperpod-cluster-debugger), NCCL (→ hyperpod-nccl), or MFU (→ hyperpod-mfu-debugger).

2026-05-16765

hyperpod-performance-debugger.md

from "awslabs/agent-plugins"

Diagnose performance issues on Amazon SageMaker HyperPod clusters — uneven NCCL bandwidth across nodes and poor filesystem throughput. Read-only. Surfaces host-side signals (Xid, ECC, NVLink, EFA reachability, FSx saturation) and routes to the appropriate sibling skill (hyperpod-node-debugger, hyperpod-nccl, hyperpod-version-checker, hyperpod-issue-report) for any remediation. Triggers on uneven NCCL across nodes, straggler node, FSx slow, checkpoint slow, dataloader slow, filesystem bottleneck, FSx throughput, cross-AZ latency, topology mismatch.

2026-05-16765

hyperpod-slurm-debugger.md

from "awslabs/agent-plugins"

Diagnostic-only skill for Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod Slurm clusters. Scope mirrors the HyperPod troubleshooting guide. Invoke when the user reports a Slurm node stuck in down/drain, "Node unexpectedly rebooted" after auto-repair, slurmd not running, jobs stuck PENDING with REASON=Resources while sinfo shows idle nodes, jobs stuck COMPLETING after node replacement, GRES/GPU counts wrong, scontrol ping failing, slurmctld unresponsive, an Action:Reboot/Replace request that did not trigger HyperPod auto-recovery, or auto-resume not restarting a job. Also triggers on "drain before reboot", "diagnose a Slurm node", "investigate stuck jobs."

2026-05-16765

hyperpod-ssm.md

from "awslabs/agent-plugins"

Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.

2026-05-16765

package.json

"author": "awslabs"

"repository": "awslabs/agent-plugins"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Administradores de redes y sistemas informáticosOcupaciones informáticas y matemáticas15-1244L4

name	hyperpod-cluster-debugger
description	Diagnose and remediate cluster-wide HyperPod (EKS or Slurm) problems — creation / deployment failures (CloudFormation, EFA health check, lifecycle scripts, capacity), EKS access, node replacement, CloudFormation nested-stack errors, post-maintenance rollback state, dangling nodes, autoscaler conflicts. Includes `--validate` pre-flight. Read-only.
metadata	{"version":"0.0.1"}

HyperPod Cluster Debugger

Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer to run it. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes).

Before any state-changing CLI: ask if it's IaC-managed. HyperPod clusters, SGs, EKS access entries, and IAM are usually provisioned via CloudFormation / CDK / Terraform. If yes, the fix belongs in IaC — running the CLI will drift and the next deploy reverts it. Use the CLI only when IaC is unavailable (locked out, predates IaC, mid-review).

scripts/diagnose-cluster.sh is read-only: it collects state via AWS APIs (and SSM for Slurm controller health) and prints each issue as [FAIL] ... → references/<file>.md § <section>.

Reference	Open when
cluster-diagnostics-detail.md	Per-finding remediation runbook (§ A–L)
cluster-operations.md	Operational deep-dives (EFA SG, EKS access, SSM, Slurm, filesystem)
cloudformation-errors.md	§ H needs the full per-resource CFN error catalog
capacity-planning.md	§ B or `--validate` flags capacity / subnet sizing
lifecycle-scripts.md	§ C points at a specific lifecycle failure
iam-permissions.md	Full IAM policy for the diagnostic

Workflow

Collect HyperPod cluster name (not EKS name), region, exact error string.
Run scripts/diagnose-cluster.sh (or --validate for pre-create).
For every [FAIL] line, Read the referenced section.
Present finding, root cause, and the Suggested-command block verbatim. Wait for customer approval.
Re-run the diagnostic to confirm.

Step 1: Run diagnostics

# Diagnose an existing cluster:
bash scripts/diagnose-cluster.sh --cluster <CLUSTER_NAME_OR_ARN> --region <REGION>

# Pre-flight (no cluster needed) — validates SGs, subnets, IAM, VPC endpoints,
# optionally S3 lifecycle scripts and per-AZ capacity:
bash scripts/diagnose-cluster.sh --validate --region <REGION> \
  --sg-ids <sg-1,sg-2> --subnet-ids <sub-1,sub-2> [--iam-role <role-arn>] \
  [--s3-uri s3://<BUCKET>/path/] [--instance-type ml.p5.48xlarge]

Pass --instance-type when the target instance type is known — enables the per-AZ capacity check (warns if none of the provided subnets are in an AZ that offers that type, which causes insufficient-capacity failures at creation time).

Tags: [PASS] · [FAIL] (counted, has → references/... pointer) · [WARN] · [INFO]. Priorities: P0 blocks operation · P1 degraded · P2 informational.

Step 2: Match signal → section

Error messages / events:

Signal	Section
`"EFA health checks did not run successfully"` (public-doc verbatim signal)	A: EFA Health Checks
Insufficient-capacity or AZ-mismatch failure at creation	B: Capacity & AZ
Lifecycle-script failure or timeout during provisioning	C: Lifecycle Scripts
kubectl auth error (server asks for credentials / no API group list)	D: EKS Access
`InService` but not all instances visible	E: Cluster Provisioning
`"Target is not connected"` / SSM errors	F: SSM Connectivity
Node replacement not happening / `batch-replace` not working	G: Node Replacement
`"Embedded stack failed"` / any CloudFormation error	H: CloudFormation Errors
`UpdateClusterSoftware` failed or cluster in post-maintenance rollback state	J: AMI & Cluster Updates
Dangling / orphaned nodes in EKS vs `list-cluster-nodes`	K: Dangling Nodes & Cleanup
Cluster Autoscaler breaks after HyperPod attached	L: Autoscaler Compatibility
Slow I/O, FSx throughput saturated	cluster-operations.md § 9
Slurm node name → instance ID lookup	I: Utilities

A: EFA Health Checks

SG missing self-reference. Add inbound + outbound self-ref to every SG on the cluster, plus least-privilege egress for the AWS APIs the node needs (HTTPS 443 to S3 / ECR / SageMaker / SSM / STS / CloudWatch Logs — via VPC-endpoint prefix-lists when possible). Full procedure: cluster-diagnostics-detail.md § A.

B: Capacity & AZ

Instance type unavailable in the requested AZ. Verify with describe-instance-type-offerings, then change AZ, use Flexible Training Plans, or request ODCR. Full: § B · strategy: capacity-planning.md.

C: Lifecycle Scripts

Script failed or timed out during provisioning. Read CloudWatch under /aws/sagemaker/Clusters/<name>/<id> — common causes: missing S3 VPC endpoint, IAM gap, CRLF line endings, instance-group name mismatch. Full: § C · layout: lifecycle-scripts.md.

D: EKS Access / kubectl

IAM identity not in EKS access entries. Verify with sts get-caller-identity, create an access entry with admin policy, update kubeconfig. Full: § D.

E: Cluster Provisioning

InService without all instances is expected under Continuous Provisioning — failures surface as events, not cluster errors. For stuck Creating/Updating/Deleting: check CFN nested stacks (§ H), IAM, capacity, events; if stuck Deleting check VPC ENI dependencies. Full: § E.

F: SSM Connectivity

Target is not connected: use sagemaker-cluster:<CLUSTER_ID>_<GROUP>-<INSTANCE_ID> format (not raw EC2 ID), install session-manager-plugin, confirm node Running. Check IAM + VPC endpoints on timeouts. Full: § F.

G: Node Replacement

Auto-repair: confirm NodeRecovery=Automatic, check Health Monitoring Agent (HMA) logs + node labels / Slurm reason, confirm capacity. Manual: reboot first, replace only if reboot fails. Replace requires the cluster to have been patched via UpdateClusterSoftware at least once and cannot target a Slurm controller node. Full: § G.

H: CloudFormation Errors

Embedded stack failed hides the real error. Drill into nested stacks via Events tab (filter Failed) until you reach a non-stack resource. CLI: describe-stack-events --query 'StackEvents[?ResourceStatus==\CREATE_FAILED`]'`. Also covers SLR creation failures and permission-boundary denials. Full: § H · catalog: cloudformation-errors.md.

I: Utilities

Map Slurm node names (ip-10-x-y-z) to HyperPod instance IDs via list-cluster-nodes or on-node /opt/ml/config/resource_config.json. Full: § I.

J: AMI & Cluster Updates

UpdateClusterSoftware fails and rolls back, or the cluster stays in a post-maintenance rollback state. Common causes: lifecycle script incompatible with new AMI, HMA version too old, insufficient rolling-update capacity. If the cluster has active nodes, collect diagnostics and escalate rather than delete-and-recreate. Full: § J.

K: Dangling Nodes & Cleanup

Nodes in kubectl get nodes but not in list-cluster-nodes (ghost EKS nodes), or the inverse (HyperPod nodes that never registered kubelet). Script flags both. Full: § K.

L: Autoscaler Compatibility

Cluster Autoscaler errors on HyperPod provider IDs and breaks autoscaling for all node groups. No officially endorsed workaround — escalate to AWS Support. Karpenter does not conflict with HyperPod nodes by default. Full: § L.

Prerequisites

aws CLI v2.13+ authenticated to the cluster's account
jq, python3, bash 4.2+
kubectl authenticated to the EKS cluster (EKS checks skipped if absent)
session-manager-plugin (Slurm controller health checks only)

IAM policy: references/iam-permissions.md.

Defaults

Region — required: pass --region or set $AWS_DEFAULT_REGION.
Mode — --cluster <NAME> (diagnose) or --validate (pre-create).
Event window — up to 500 most recent events (5 × 100, paginated).
Colors — auto-disabled on non-TTY; --no-color to force off.

Error handling

Failure	Script	Tell the customer
`aws sts get-caller-identity` fails	Exit 1	"Fix AWS credentials and rerun."
Cluster not found	Exit 1 after listing region's clusters	"Confirm HyperPod cluster name (not EKS) and region."
`sagemaker:` / `ec2:` / `eks:` / `logs:` denied	Warn, add `Missing IAM permission for <API>`, continue	"Grant the listed IAM action and rerun."
`kubectl` absent or unauthenticated	Skip EKS checks (access entries, add-ons, aws-auth, nodes)	"Install/authenticate kubectl."
`session-manager-plugin` absent (Slurm)	Skip Slurm controller probe	"Install session-manager-plugin."
SSM throttled / times out (180s)	Retry with backoff; warn and continue if still failing	"Rerun later — script is idempotent."
CloudWatch log group not found	Skip CloudWatch check	"CloudWatch not configured on this cluster."

Exit codes: 0 no critical failures · 1 one or more critical failures (cluster not found, fatal prerequisite missing, or any [FAIL] in diagnose or --validate mode). [WARN] lines do not affect the exit code.

Skill delegation

Need	Use
Shell on nodes	`hyperpod-ssm`
Version comparison across nodes	`hyperpod-version-checker`

Escalate to AWS Support

Escalate when:

EFA health checks fail despite correct SG rules.
Capacity errors persist despite a valid Flexible Training Plan / ODCR.
Node replacement fails repeatedly without clear events / log signal.
Cluster stuck in a non-terminal state (Creating, Updating, or a post-maintenance rollback state) for an extended period.
CloudFormation root-cause is an internal service error.

Before opening the case

Run these commands and attach the output. Goal: AWS Support has everything at case open.

# 1. Cluster identity + status (confirms region, ARN, orchestrator, instance groups)
aws sagemaker describe-cluster --cluster-name <CLUSTER> --region <REGION>

# 2. Full cluster-level diagnostic bundle
bash scripts/diagnose-cluster.sh --cluster <CLUSTER> --region <REGION> > diag.txt

# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report skill)
#    See skills/hyperpod-issue-report/SKILL.md for the exact invocation.

Include in the case

Cluster name + ARN (or ClusterId suffix) and AWS region
ClusterStatus + FailureMessage from describe-cluster
Timestamp window (UTC start / end) of the failure
Exact error strings observed (copy verbatim from events / logs / console)
Affected instance IDs / NodeLogicalIds / instance group names
diag.txt from step 2 above
S3 URI of the hyperpod-issue-report bundle from step 3

hyperpod-cluster-debugger

Más de este repositorio

Más de este repositorio

HyperPod Cluster Debugger

Workflow

Step 1: Run diagnostics

Step 2: Match signal → section

A: EFA Health Checks

B: Capacity & AZ

C: Lifecycle Scripts

D: EKS Access / kubectl

E: Cluster Provisioning

F: SSM Connectivity

G: Node Replacement

H: CloudFormation Errors

I: Utilities

J: AMI & Cluster Updates

K: Dangling Nodes & Cleanup

L: Autoscaler Compatibility

Prerequisites

Defaults

Error handling

Skill delegation

Escalate to AWS Support

Before opening the case

Include in the case

HyperPod Cluster Debugger

Workflow

Step 1: Run diagnostics

Step 2: Match signal → section

A: EFA Health Checks

B: Capacity & AZ

C: Lifecycle Scripts

D: EKS Access / kubectl

E: Cluster Provisioning

F: SSM Connectivity

G: Node Replacement

H: CloudFormation Errors

I: Utilities

J: AMI & Cluster Updates

K: Dangling Nodes & Cleanup

L: Autoscaler Compatibility

Prerequisites

Defaults

Error handling

Skill delegation

Escalate to AWS Support

Before opening the case

Include in the case