Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

hyperpod-slurm-debugger

Name: Hyperpod Slurm Debugger
Author: awslabs

// Diagnostic-only skill for Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod Slurm clusters. Scope mirrors the HyperPod troubleshooting guide. Invoke when the user reports a Slurm node stuck in down/drain, "Node unexpectedly rebooted" after auto-repair, slurmd not running, jobs stuck PENDING with REASON=Resources while sinfo shows idle nodes, jobs stuck COMPLETING after node replacement, GRES/GPU counts wrong, scontrol ping failing, slurmctld unresponsive, an Action:Reboot/Replace request that did not trigger HyperPod auto-recovery, or auto-resume not restarting a job. Also triggers on "drain before reboot", "diagnose a Slurm node", "investigate stuck jobs."

In Manus ausführen

$ git log --oneline --stat

stars:765

forks:107

updated:16. Mai 2026 um 23:28

Datei-Explorer

3 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

model-evaluation.md

from "awslabs/agent-plugins"

Generates python code that evaluates SageMaker models. Supports two evaluation types: LLM-as-Judge and Custom Scorer. Use when the user says "evaluate my model", "test model performance", "how did my model perform", "compare models", or other similar requests.

2026-05-26765

hyperpod-cluster-debugger.md

from "awslabs/agent-plugins"

Diagnose and remediate cluster-wide HyperPod (EKS or Slurm) problems — creation / deployment failures (CloudFormation, EFA health check, lifecycle scripts, capacity), EKS access, node replacement, CloudFormation nested-stack errors, post-maintenance rollback state, dangling nodes, autoscaler conflicts. Includes `--validate` pre-flight. Read-only.

2026-05-16765

hyperpod-nccl.md

from "awslabs/agent-plugins"

Diagnose NCCL failures and adjacent training-pod failures on HyperPod GPU clusters (EKS or Slurm) — training hangs, AllReduce / collective-op timeouts, EFA or libfabric errors, rendezvous failures, EFA TCP fallback, /dev/shm or memlock issues, NCCL version mismatch across pods, container OOM / exit-137 / OOMKilled, GPU OOM (CUDA out of memory), CrashLoopBackOff / Pending pods, MASTER_ADDR DNS, NetworkPolicy blocking. Not for single-node hardware faults (→ hyperpod-node-debugger § G) or cluster-creation EFA / SSM failures (→ hyperpod-cluster-debugger § A / § F).

2026-05-16765

hyperpod-node-debugger.md

from "awslabs/agent-plugins"

Diagnose and remediate per-node issues on a HyperPod cluster (EKS or Slurm) — a specific node is unhealthy, unresponsive, stuck, or needs replacing. Covers on-node EFA, GPU / accelerator hardware (XID, ECC, NVLink, row-remap, DCGM), Slurm node down/drained, disk and memory pressure, per-node lifecycle-script failures, SSM agent, container runtime, kernel panics, pod networking. Read-only. Not for cluster-wide provisioning (→ hyperpod-cluster-debugger), NCCL (→ hyperpod-nccl), or MFU (→ hyperpod-mfu-debugger).

2026-05-16765

hyperpod-performance-debugger.md

from "awslabs/agent-plugins"

Diagnose performance issues on Amazon SageMaker HyperPod clusters — uneven NCCL bandwidth across nodes and poor filesystem throughput. Read-only. Surfaces host-side signals (Xid, ECC, NVLink, EFA reachability, FSx saturation) and routes to the appropriate sibling skill (hyperpod-node-debugger, hyperpod-nccl, hyperpod-version-checker, hyperpod-issue-report) for any remediation. Triggers on uneven NCCL across nodes, straggler node, FSx slow, checkpoint slow, dataloader slow, filesystem bottleneck, FSx throughput, cross-AZ latency, topology mismatch.

2026-05-16765

hyperpod-ssm.md

from "awslabs/agent-plugins"

Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.

2026-05-16765

package.json

"author": "awslabs"

"repository": "awslabs/agent-plugins"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Netzwerk- und ComputersystemadministratorenInformatik- und Mathematikberufe15-1244L4

name	hyperpod-slurm-debugger
description	Diagnostic-only skill for Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod Slurm clusters. Scope mirrors the HyperPod troubleshooting guide. Invoke when the user reports a Slurm node stuck in down/drain, "Node unexpectedly rebooted" after auto-repair, slurmd not running, jobs stuck PENDING with REASON=Resources while sinfo shows idle nodes, jobs stuck COMPLETING after node replacement, GRES/GPU counts wrong, scontrol ping failing, slurmctld unresponsive, an Action:Reboot/Replace request that did not trigger HyperPod auto-recovery, or auto-resume not restarting a job. Also triggers on "drain before reboot", "diagnose a Slurm node", "investigate stuck jobs."
metadata	{"version":"0.0.1"}

HyperPod Slurm Debugger

Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on HyperPod Slurm clusters. Do not run, recommend, or print any state-mutating command. For remediation, link to the official AWS or Slurm documentation.

When to invoke

Invoke when the user reports any of the symptoms in the decision table.

When NOT to invoke

Cluster has Orchestrator.Eks — invoke hyperpod-node-debugger or hyperpod-nccl.
Single-node hardware fault with healthy Slurm scheduler — invoke hyperpod-node-debugger.
NCCL training-hang investigation — invoke hyperpod-nccl.
Node unreachable via SSM — invoke hyperpod-ssm.

Constraints

Read-only. Do not run, recommend, or print state-mutating commands.
For any remediation, link to AWS or Slurm docs. The user authorizes and executes.
IaC-managed cluster (Terraform / CloudFormation / CDK): warn that direct mutation drifts the live state from the IaC plan.

Canonical recovery URLs: references/slurm-details.md → Authoritative recovery documentation.

Prerequisites

AWS CLI v2, authenticated for the target account and region with permissions:
- sagemaker:DescribeCluster, sagemaker:ListClusterNodes
- ssm:StartSession on the HyperPod-created SSM document
Session Manager plugin installed locally.
jq ≥ 1.6.
unbuffer (from the expect package). Required — without it aws ssm start-session returns empty stdout intermittently with Cannot perform start session: EOF and every check silently misreports. Install: expect package on Amazon Linux / RHEL / Debian / Ubuntu / macOS. Script exits at prerequisite check if missing.

Procedure

Step 1 — Collect inputs

Ask the user for:

HyperPod cluster name (not Slurm partition name).
AWS region.
Optional: a specific Slurm node name.

Step 2 — Confirm orchestrator

aws sagemaker describe-cluster --cluster-name <NAME/ARN> --region <REGION> \
  --query 'Orchestrator' --output json

If Orchestrator.Eks is present, stop. Route per When NOT to invoke.

Step 3 — Run the diagnostic script

bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION>
# Scope to a node:
bash scripts/slurm-diagnose.sh --cluster <NAME> --region <REGION> --node <SLURM_NODE>

Relay the script output to the user verbatim.

Step 4 — Map findings → docs

For each finding, look up the section in the decision table and link the user to the corresponding AWS / Slurm doc. Do not type out remediation commands.

Decision table

Symptom (`sinfo -o "%N %T %30E"` or script finding)	Section
Node state = `down` or `down*`, reason other than below	A: Node Down
Node state = `down*`, Reason = `Node unexpectedly rebooted`	B: Unexpected Reboot
Jobs `PENDING` with `REASON=Resources` while nodes are idle	C: Controller State
Jobs stuck `COMPLETING` after node replacement	C: Controller State
`scontrol ping` returns `DOWN` for the controller	C: Controller State
GRES (GPU) counts incorrect or not released	C: Controller State
`state=fail` issued but no recovery occurred	D: Action Reason Mismatch
Accounting errors or RPC errors mentioning `dbd`	C: Controller State (slurmdbd)
`slurm.conf` edited; new partitions or nodes not visible	C: Controller State (config)
Job exited on a hardware failure but did not restart	E: Auto-resume

Defaults

Behavior	Default	Override
Mode	read-only — always; no remediation flag exists	n/a
Region	`$AWS_DEFAULT_REGION`, falling back to `us-east-1`	`--region <R>`
Scope	all nodes in `down` / `drain` / `fail` / "unexpectedly rebooted"	`--node <SLURM_NODE_NAME>`
Output	colorized terminal	`--no-color`
SSM target format	`sagemaker-cluster:<clusterId>_<instanceGroupName>-<instanceId>` (derived)	n/a
Controller discovery	`--controller-group` (if set) → `SlurmConfig.NodeType=Controller` → `provisioning_parameters.json`	`--controller-group <N>`

Error handling

Failure	Skill behavior	Required user action
`describe-cluster` fails	Print AWS error; exit 1	Fix credentials/region; verify cluster name
Cluster has `Orchestrator.Eks`	Exit 1 with pointer to EKS-side skills	Use `hyperpod-node-debugger` or `hyperpod-nccl`
`session-manager-plugin` missing / SSM unreachable	`sinfo` returns empty; exit 1	Install plugin; verify node `InService`
Disk ≥ 95 % full on a `down` node	Report finding `disk-full-<node>`	Refer to AWS troubleshooting docs
Missing `jq` or `aws`	Exit 1 at prerequisite check	Install per Prerequisites

A: Node Down

Node is down because slurmd stopped responding. Causes: slurmd crash, disk full, OOM, network partition, hardware fault.

Script checks: systemctl is-active slurmd, srun -w <NODE> hostname (RPC layer), disk, memory.

Link: https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md

If node returns to down after a manual resume → escalate to hyperpod-node-debugger.

Context: references/slurm-details.md § A.

B: Unexpected Reboot

Node is down* with Reason "Node unexpectedly rebooted" because slurmd re-registered after an out-of-band reboot. Upstream Slurm behavior, not HyperPod. Node is typically healthy.

Links:

https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md
https://slurm.schedmd.com/scontrol.html (state=resume semantics)

If node reboots again within minutes → escalate to hyperpod-node-debugger.

Context: references/slurm-details.md § B.

C: Controller State

slurmctld in-memory state can desync from the on-disk state. A controller restart reloads from StateSaveLocation and clears bad caches. User decides and executes.

Restart may help:

Symptom	Why
`PENDING` with `REASON=Resources`, idle nodes	Re-evaluates the queue
Jobs stuck `COMPLETING` after node replacement	Controller held a reference to the old node
GRES (GPU, EFA) not released after a job ends	Resource accounting de-synced
Nodes stuck `Unknown` after reboot, `slurmd` is up	Re-registration was not processed
`scontrol ping` times out	Controller event loop is hung
Lost connection to `slurmdbd` / RPC errors	DBD connection wedged

Do NOT restart when:

HyperPod replacement (Action:Replace) in progress on any node — concurrent changes fail the replacement.
Only one compute node is bad — restart slurmd on that node.
sinfo and squeue are responsive — problem is elsewhere.
journalctl -u slurmctld not reviewed yet — panic / OOM will reproduce.
slurm.conf was just edited — try scontrol reconfigure first.

Folded triggers

slurmdbd disconnected — sacct fails, accounting fields show Unknown, controller log spams Unable to contact slurmdbd. Restore slurmdbd before considering controller restart. https://slurm.schedmd.com/accounting.html · details.
Stale config — slurm.conf / topology.conf mtime > slurmctld start. scontrol reconfigure first; restart is fallback. https://slurm.schedmd.com/scontrol.html · details.

Restart procedure / what's preserved:

Context: references/slurm-details.md § C.

D: Action Reason Mismatch

scontrol update state=fail reason=... was issued with a reason that does not match Action:Reboot or Action:Replace exactly. HyperPod silently ignores anything else. Script detects near-misses on nodes in fail state.

Required strings (case-sensitive, no whitespace, no punctuation):

Action:Reboot
Action:Replace

Link: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html

Context: references/slurm-details.md § Action reason-string validation.

E: Auto-resume

--auto-resume=1 is an srun step option. It re-runs the step after HMA (the Health Monitoring Agent) flags a node and Automatic node recovery replaces it.

Why it didn't restart the job:

Flag on sbatch not srun — per-step; sbatch directives are silently ignored.
HMA did not flag the node — failure was application/transient, not hardware. Step exits as a normal Slurm failure.
Cluster NodeRecovery is None — faulty nodes are labeled but not replaced.
No checkpointing — step restarts from process zero each iteration.
AMI predates HMA support (released 2025-09-11) — needs AMI / cluster-software update.

Link: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-auto-resume.html

Context: references/slurm-details.md § HyperPod auto-resume.

Escalation

Condition	Next skill
Node returns to `down` shortly after a manual resume	`hyperpod-node-debugger` (hardware)
`slurmd` logs contain CUDA / NVIDIA / XID errors	`hyperpod-node-debugger` § G
Disk full or `/dev/shm` exhausted	`hyperpod-node-debugger` § I
Node unreachable via SSM	`hyperpod-ssm`
Controller restart does not clear `COMPLETING` after 2 attempts	`hyperpod-issue-report` + AWS Support

hyperpod-slurm-debugger

Mehr aus diesem Repository

HyperPod Slurm Debugger

When to invoke

When NOT to invoke

Constraints

Prerequisites

Procedure

Step 1 — Collect inputs

Step 2 — Confirm orchestrator

Step 3 — Run the diagnostic script

Step 4 — Map findings → docs

Decision table

Defaults

Error handling

A: Node Down

B: Unexpected Reboot

C: Controller State

Folded triggers

D: Action Reason Mismatch

E: Auto-resume

Escalation

HyperPod Slurm Debugger

When to invoke

When NOT to invoke

Constraints

Prerequisites

Procedure

Step 1 — Collect inputs

Step 2 — Confirm orchestrator

Step 3 — Run the diagnostic script

Step 4 — Map findings → docs

Decision table

Defaults

Error handling

A: Node Down

B: Unexpected Reboot

C: Controller State

Folded triggers

D: Action Reason Mismatch

E: Auto-resume

Escalation

Mehr aus diesem Repository