一键在 Manus 中运行任何 Skill

$pwd:

deploy-slurm-cluster

Name: Deploy Slurm Cluster
Author: awslabs

// Install cert-manager, AWS LB Controller (Pod Identity), MariaDB, Slurm operator, and Slurm cluster on HyperPod EKS using install.sh, including subnet tagging and NLB configuration for SSH access

在 Manus 中运行

$ git log --oneline --stat

stars:433

forks:191

updated:2026年3月25日 17:12

SKILL.md

readonly

related-skills.json

同仓库

bash-testing.md

from "awslabs/awsome-distributed-ai"

Patterns for unit testing bash scripts using bats-core, including AWS CLI mocking, jq/sed/awk testing, and cross-platform portability

2026-03-25433

build-slurm-image.md

from "awslabs/awsome-distributed-ai"

Build the Slurm compute node container image via CodeBuild or local Docker, generate SSH keys, and render Helm values using setup.sh

2026-03-25433

deploy-infrastructure.md

from "awslabs/awsome-distributed-ai"

Deploy HyperPod EKS infrastructure using deploy.sh via CloudFormation or Terraform, including AZ resolution, stack outputs, and kubeconfig setup

2026-03-25433

deployment-preflight.md

from "awslabs/awsome-distributed-ai"

Validate all prerequisites (CLI tools, AWS credentials, environment variables, kubectl context) before running slinky-slurm deployment scripts

2026-03-25433

validate-deployment.md

from "awslabs/awsome-distributed-ai"

Post-deployment health checks for slinky-slurm including pod status, Slurm node registration, SSH connectivity, and test job submission

2026-03-25433

package.json

"author": "awslabs"

"repository": "awslabs/awsome-distributed-ai"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

网络与计算机系统管理员计算机与数学类职业15-1244L4

name	deploy-slurm-cluster
description	Install cert-manager, AWS LB Controller (Pod Identity), MariaDB, Slurm operator, and Slurm cluster on HyperPod EKS using install.sh, including subnet tagging and NLB configuration for SSH access

Deploy Slurm Cluster

Overview

Use this skill to deploy the Slurm cluster on a HyperPod EKS cluster. This is Phase 3 of the slinky-slurm deployment workflow and represents the final step to get a running Slurm cluster with SSH access.

The install.sh script orchestrates:

Running setup.sh (unless --skip-setup)
Installing cert-manager (v1.19.2)
Tagging public subnets for internet-facing NLBs
Installing the AWS Load Balancer Controller (chart v1.11.0) with Pod Identity
Applying FSx Lustre PV/PVC
Installing MariaDB operator and instance
Installing the Slurm operator (Slinky)
Installing the Slurm cluster via Helm
Configuring NLB for SSH login access
Waiting for the NLB endpoint

Prerequisites

Phase 1 complete: HyperPod EKS infrastructure deployed (deploy.sh)
env_vars.sh exists: Generated by deploy.sh, provides EKS_CLUSTER_NAME and VPC_ID (required for LB Controller and subnet tagging)
EBS CSI driver installed: With a gp3 StorageClass (required for MariaDB and Slurm persistent volumes). See the HyperPod EBS CSI docs.
kubectl configured: Context points to the correct EKS cluster
Helm installed: helm command available
curl installed: Used for public IP detection and IAM policy download
slurm-values.yaml: Generated by setup.sh (or pass flags to let install.sh run setup.sh automatically)

See the deployment-preflight skill for detailed prerequisite validation.

Steps

Step 1: Choose how to run install.sh

Full install (runs setup.sh internally):

# Build image via CodeBuild + install cluster (g5 via CloudFormation)
bash install.sh --instance-type ml.g5.8xlarge --infra cfn

# Build image locally + install cluster (p5 via Terraform)
bash install.sh --instance-type ml.p5.48xlarge --instance-count 2 --infra tf --local-build

# Skip image build (image already in ECR) + install cluster
bash install.sh --instance-type ml.g5.8xlarge --infra cfn --skip-build

Skip setup (slurm-values.yaml already exists):

# Use previously generated slurm-values.yaml
bash install.sh --skip-setup

WARNING: --skip-setup requires that slurm-values.yaml already exists in the project directory. The script will fail if it's missing.

Step 2: Wait for completion

install.sh will:

Install cert-manager and wait for webhook readiness
Tag public subnets with kubernetes.io/role/elb=1
Create IAM policy/role for LB Controller (Pod Identity), install chart
Apply FSx Lustre PV/PVC
Install MariaDB (waits up to 5 minutes for readiness)
Install Slurm operator
Install Slurm cluster
Configure NLB (waits up to 5 minutes for endpoint)
Print the SSH command

Total time: approximately 5-10 minutes after setup.sh completes.

Step 3: Connect to the login node

# The SSH command is printed at the end of install.sh
ssh -i ~/.ssh/id_ed25519_slurm root@<NLB_HOSTNAME>

What install.sh Does Internally

Phase A: Setup (unless --skip-setup)

Calls setup.sh with pass-through flags:

bash setup.sh ${SETUP_ARGS}

All --instance-type, --instance-count, --infra, --repo-name, --tag, --local-build, --skip-build, and --region flags are passed through to setup.sh.

If --skip-setup is specified, verifies slurm-values.yaml exists.

Phase B: Source env_vars.sh

Sources env_vars.sh (generated by deploy.sh) to get EKS_CLUSTER_NAME and VPC_ID. Both are required — the script exits with an error if either is missing.

Phase C: cert-manager Installation

Adds the jetstack Helm repo
Installs cert-manager chart (v1.19.2) in cert-manager namespace
Waits for the cert-manager webhook deployment to be Available (120s timeout)

# What the script runs:
helm install cert-manager jetstack/cert-manager \
    --version=1.19.2 --namespace=cert-manager --create-namespace \
    --set crds.enabled=true --wait

kubectl wait --for=condition=Available \
    deployment/cert-manager-webhook -n cert-manager --timeout=120s

The webhook wait prevents race conditions with components that create Certificate resources at install time (e.g., the Slurm operator).

Phase D: Public Subnet Tagging

Tags all public subnets in the VPC with kubernetes.io/role/elb=1. The AWS LB Controller requires this tag to discover subnets for internet-facing NLBs. The HyperPod CFN template does not add this tag.

# What the script runs:
PUBLIC_SUBNET_IDS=$(aws ec2 describe-subnets \
    --filters "Name=vpc-id,Values=${VPC_ID}" \
              "Name=map-public-ip-on-launch,Values=true" \
    --query "Subnets[].SubnetId" --output text)

for SUBNET_ID in ${PUBLIC_SUBNET_IDS}; do
    aws ec2 create-tags --resources "${SUBNET_ID}" \
        --tags "Key=kubernetes.io/role/elb,Value=1"
done

Phase E: AWS Load Balancer Controller Installation

Uses Pod Identity (not IRSA/eksctl) for IAM:

Creates IAM policy AWSLoadBalancerControllerIAMPolicy_slinky (from upstream v2.11.0 policy JSON)
Creates IAM role AmazonEKS_LB_Controller_Role_slinky with pods.eks.amazonaws.com trust policy
Attaches the policy to the role
Creates a Pod Identity association for the aws-load-balancer-controller service account in kube-system
Installs the aws-load-balancer-controller Helm chart (v1.11.0)
Waits for the LB Controller deployment to be Available (120s timeout)

# What the script runs (simplified):
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
    --version=1.11.0 --namespace=kube-system \
    --set clusterName="${EKS_CLUSTER_NAME}" \
    --set region="${AWS_REGION}" \
    --set vpcId="${VPC_ID}" --wait

Phase F: FSx Lustre PV/PVC

Applies the pre-generated lustre-pvc-slurm.yaml (rendered from lustre-pvc-slurm.yaml.template by setup.sh) which creates a static PersistentVolume and PersistentVolumeClaim pointing to the FSx filesystem provisioned by the HyperPod CFN stack.

# What the script runs:
kubectl create namespace slurm
kubectl apply -f lustre-pvc-slurm.yaml

Phase G: MariaDB Installation

Adds the mariadb-operator Helm repo
Installs mariadb-operator chart (v25.10.4) in mariadb namespace
Applies mariadb.yaml (MariaDB custom resource in slurm namespace)
Waits for MariaDB to be Ready (up to 300s timeout)

# What the script runs:
helm install mariadb-operator mariadb-operator/mariadb-operator \
    --version=25.10.4 --namespace=mariadb --create-namespace \
    --set crds.enabled=true --wait

kubectl apply -f mariadb.yaml
kubectl wait --for=condition=Ready mariadb/mariadb -n slurm --timeout=300s

Phase H: Slurm Operator Installation

Deletes stale CRDs if upgrading from an older version
Installs slurm-operator OCI chart (v1.0.1) in slinky namespace

# What the script runs:
helm install slurm-operator \
    oci://ghcr.io/slinkyproject/charts/slurm-operator \
    --version=1.0.1 --namespace=slinky --create-namespace \
    --set crds.enabled=true --wait

Phase I: Slurm Cluster Installation

Installs the slurm OCI chart (v1.0.1) with the generated values file:

# What the script runs:
helm install slurm \
    oci://ghcr.io/slinkyproject/charts/slurm \
    --values=slurm-values.yaml \
    --version=1.0.1 --namespace=slurm

Phase J: NLB Configuration

Waits for the slurm-login-slinky service to exist
Gets the caller's public IP via curl https://checkip.amazonaws.com
Renders slurm-login-service-patch.yaml from the template (substitutes ${ip_address} for NLB source range restriction)
Patches the service with NLB annotations
Polls for the NLB hostname (up to 60 iterations, 5s apart = 5 minutes)

# What the script runs:
IP_ADDRESS=$(curl -s https://checkip.amazonaws.com)
sed "s|\${ip_address}|${IP_ADDRESS}|g" \
    slurm-login-service-patch.yaml.template \
    > slurm-login-service-patch.yaml

kubectl patch service slurm-login-slinky -n slurm \
    --patch-file slurm-login-service-patch.yaml

Command Reference

Usage: install.sh [OPTIONS]

Optional:
  --skip-setup              Use previously generated slurm-values.yaml
  --region <region>         AWS region (default: AWS CLI configured or us-west-2)
  --help                    Show help

Options passed through to setup.sh:
  --instance-type <type>    SageMaker instance type (e.g., ml.g5.8xlarge, ml.p5.48xlarge)
  --instance-count <N>      Number of compute node instances (default: 4)
  --infra <cfn|tf>          Infrastructure method for CodeBuild stack
  --repo-name <name>        ECR repository name
  --tag <tag>               Image tag
  --local-build             Build image locally instead of CodeBuild
  --skip-build              Skip image build (use existing image in ECR)

Verification

After install.sh completes, verify with:

# Check cert-manager is running
kubectl -n cert-manager get pods

# Check LB Controller is running
kubectl -n kube-system get pods -l app.kubernetes.io/name=aws-load-balancer-controller

# Check all Slurm pods are Running
kubectl -n slurm get pods
kubectl -n slinky get pods
kubectl -n mariadb get pods

# Check FSx PVC is bound
kubectl get pvc fsx-claim -n slurm

# Check the login service has an NLB endpoint
kubectl get svc slurm-login-slinky -n slurm

# SSH into the login node
ssh -i ~/.ssh/id_ed25519_slurm root@<NLB_HOSTNAME>

# On the login node, verify Slurm
sinfo          # Shows nodes and partitions
squeue         # Shows job queue (should be empty)

See the validate-deployment skill for comprehensive post-deployment validation.

Troubleshooting

Symptom	Cause	Fix
cert-manager webhook timeout	CRDs not installed or resource constraints	Check: `kubectl -n cert-manager get pods`; verify `--set crds.enabled=true`
LB Controller `unable to resolve at least one subnet`	Public subnets missing `kubernetes.io/role/elb=1` tag	Run the subnet tagging step in `install.sh` or tag manually
Pod Identity association fails	EKS Pod Identity addon not enabled	Enable via: `aws eks create-addon --cluster-name <name> --addon-name eks-pod-identity-agent`
MariaDB wait times out	EBS CSI driver not installed or `gp3` StorageClass missing	Check: `kubectl get sc gp3`; see HyperPod EBS CSI docs
Slurm operator install fails	OCI registry unreachable	Verify: `helm show chart oci://ghcr.io/slinkyproject/charts/slurm-operator --version 1.0.1`
`slurm-values.yaml not found` with `--skip-setup`	Values file not generated	Run `setup.sh` first, or remove `--skip-setup`
`EKS_CLUSTER_NAME not set`	`env_vars.sh` missing or not sourced	Run `deploy.sh` first, or manually create `env_vars.sh`
NLB hostname not available	NLB provisioning still in progress	Wait 2-3 minutes and check: `kubectl get svc slurm-login-slinky -n slurm`
SSH connection refused	NLB not fully provisioned or SSH key mismatch	Wait for NLB DNS to propagate; verify key: `ssh -i ~/.ssh/id_ed25519_slurm -v root@<hostname>`
Helm install says "already installed"	Idempotent check -- previous install exists	To reinstall: `helm uninstall slurm -n slurm` then re-run
CRD conflicts during upgrade	Stale CRDs from older Slinky version	Script handles this (deletes stale CRDs automatically)
`curl: checkip.amazonaws.com` fails	No internet access from the machine	Manually set IP and render the patch template

References

install.sh -- Main cluster installation script
destroy.sh -- Reverse teardown (LB Controller + Pod Identity + IAM, cert-manager, FSx PVC, MariaDB, Slurm)
setup.sh -- Called internally for image build and values generation
mariadb.yaml -- MariaDB custom resource definition
lustre-pvc-slurm.yaml.template -- FSx Lustre PV/PVC template
slurm-values.yaml.template -- Helm values template
slurm-login-service-patch.yaml.template -- NLB service patch template
lib/deploy_helpers.sh -- resolve_helm_profile() (used by setup.sh)
HyperPod EBS CSI docs

deploy-slurm-cluster

同仓库更多 Skills

同仓库更多 Skills

Deploy Slurm Cluster

Overview

Prerequisites

Steps

Step 1: Choose how to run install.sh

Step 2: Wait for completion

Step 3: Connect to the login node

What install.sh Does Internally

Phase A: Setup (unless --skip-setup)

Phase B: Source env_vars.sh

Phase C: cert-manager Installation

Phase D: Public Subnet Tagging

Phase E: AWS Load Balancer Controller Installation

Phase F: FSx Lustre PV/PVC

Phase G: MariaDB Installation

Phase H: Slurm Operator Installation

Phase I: Slurm Cluster Installation

Phase J: NLB Configuration

Command Reference

Verification

Troubleshooting

References

Deploy Slurm Cluster

Overview

Prerequisites

Steps

Step 1: Choose how to run install.sh

Step 2: Wait for completion

Step 3: Connect to the login node

What install.sh Does Internally

Phase A: Setup (unless --skip-setup)

Phase B: Source env_vars.sh

Phase C: cert-manager Installation

Phase D: Public Subnet Tagging

Phase E: AWS Load Balancer Controller Installation

Phase F: FSx Lustre PV/PVC

Phase G: MariaDB Installation

Phase H: Slurm Operator Installation

Phase I: Slurm Cluster Installation

Phase J: NLB Configuration

Command Reference

Verification

Troubleshooting

References