一键导入
deploy-slurm-cluster
// Install cert-manager, AWS LB Controller (Pod Identity), MariaDB, Slurm operator, and Slurm cluster on HyperPod EKS using install.sh, including subnet tagging and NLB configuration for SSH access
// Install cert-manager, AWS LB Controller (Pod Identity), MariaDB, Slurm operator, and Slurm cluster on HyperPod EKS using install.sh, including subnet tagging and NLB configuration for SSH access
Patterns for unit testing bash scripts using bats-core, including AWS CLI mocking, jq/sed/awk testing, and cross-platform portability
Build the Slurm compute node container image via CodeBuild or local Docker, generate SSH keys, and render Helm values using setup.sh
Deploy HyperPod EKS infrastructure using deploy.sh via CloudFormation or Terraform, including AZ resolution, stack outputs, and kubeconfig setup
Validate all prerequisites (CLI tools, AWS credentials, environment variables, kubectl context) before running slinky-slurm deployment scripts
Post-deployment health checks for slinky-slurm including pod status, Slurm node registration, SSH connectivity, and test job submission
| name | deploy-slurm-cluster |
| description | Install cert-manager, AWS LB Controller (Pod Identity), MariaDB, Slurm operator, and Slurm cluster on HyperPod EKS using install.sh, including subnet tagging and NLB configuration for SSH access |
Use this skill to deploy the Slurm cluster on a HyperPod EKS cluster. This is Phase 3 of the slinky-slurm deployment workflow and represents the final step to get a running Slurm cluster with SSH access.
The install.sh script orchestrates:
setup.sh (unless --skip-setup)deploy.sh)deploy.sh, provides EKS_CLUSTER_NAME
and VPC_ID (required for LB Controller and subnet tagging)gp3 StorageClass (required for
MariaDB and Slurm persistent volumes). See the
HyperPod EBS CSI docs.helm command availablesetup.sh (or pass flags to let
install.sh run setup.sh automatically)See the deployment-preflight skill for detailed prerequisite validation.
Full install (runs setup.sh internally):
# Build image via CodeBuild + install cluster (g5 via CloudFormation)
bash install.sh --instance-type ml.g5.8xlarge --infra cfn
# Build image locally + install cluster (p5 via Terraform)
bash install.sh --instance-type ml.p5.48xlarge --instance-count 2 --infra tf --local-build
# Skip image build (image already in ECR) + install cluster
bash install.sh --instance-type ml.g5.8xlarge --infra cfn --skip-build
Skip setup (slurm-values.yaml already exists):
# Use previously generated slurm-values.yaml
bash install.sh --skip-setup
WARNING: --skip-setup requires that slurm-values.yaml already exists
in the project directory. The script will fail if it's missing.
install.sh will:
kubernetes.io/role/elb=1Total time: approximately 5-10 minutes after setup.sh completes.
# The SSH command is printed at the end of install.sh
ssh -i ~/.ssh/id_ed25519_slurm root@<NLB_HOSTNAME>
Calls setup.sh with pass-through flags:
bash setup.sh ${SETUP_ARGS}
All --instance-type, --instance-count, --infra, --repo-name, --tag, --local-build,
--skip-build, and --region flags are passed through to setup.sh.
If --skip-setup is specified, verifies slurm-values.yaml exists.
Sources env_vars.sh (generated by deploy.sh) to get EKS_CLUSTER_NAME
and VPC_ID. Both are required — the script exits with an error if
either is missing.
cert-manager chart (v1.19.2) in cert-manager namespace# What the script runs:
helm install cert-manager jetstack/cert-manager \
--version=1.19.2 --namespace=cert-manager --create-namespace \
--set crds.enabled=true --wait
kubectl wait --for=condition=Available \
deployment/cert-manager-webhook -n cert-manager --timeout=120s
The webhook wait prevents race conditions with components that create Certificate resources at install time (e.g., the Slurm operator).
Tags all public subnets in the VPC with kubernetes.io/role/elb=1. The AWS
LB Controller requires this tag to discover subnets for internet-facing
NLBs. The HyperPod CFN template does not add this tag.
# What the script runs:
PUBLIC_SUBNET_IDS=$(aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=${VPC_ID}" \
"Name=map-public-ip-on-launch,Values=true" \
--query "Subnets[].SubnetId" --output text)
for SUBNET_ID in ${PUBLIC_SUBNET_IDS}; do
aws ec2 create-tags --resources "${SUBNET_ID}" \
--tags "Key=kubernetes.io/role/elb,Value=1"
done
Uses Pod Identity (not IRSA/eksctl) for IAM:
AWSLoadBalancerControllerIAMPolicy_slinky (from
upstream v2.11.0 policy JSON)AmazonEKS_LB_Controller_Role_slinky with
pods.eks.amazonaws.com trust policyaws-load-balancer-controller service account in kube-systemaws-load-balancer-controller Helm chart (v1.11.0)# What the script runs (simplified):
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
--version=1.11.0 --namespace=kube-system \
--set clusterName="${EKS_CLUSTER_NAME}" \
--set region="${AWS_REGION}" \
--set vpcId="${VPC_ID}" --wait
Applies the pre-generated lustre-pvc-slurm.yaml (rendered from
lustre-pvc-slurm.yaml.template by setup.sh) which creates a static
PersistentVolume and PersistentVolumeClaim pointing to the FSx filesystem
provisioned by the HyperPod CFN stack.
# What the script runs:
kubectl create namespace slurm
kubectl apply -f lustre-pvc-slurm.yaml
mariadb-operator chart (v25.10.4) in mariadb namespacemariadb.yaml (MariaDB custom resource in slurm namespace)# What the script runs:
helm install mariadb-operator mariadb-operator/mariadb-operator \
--version=25.10.4 --namespace=mariadb --create-namespace \
--set crds.enabled=true --wait
kubectl apply -f mariadb.yaml
kubectl wait --for=condition=Ready mariadb/mariadb -n slurm --timeout=300s
slurm-operator OCI chart (v1.0.1) in slinky namespace# What the script runs:
helm install slurm-operator \
oci://ghcr.io/slinkyproject/charts/slurm-operator \
--version=1.0.1 --namespace=slinky --create-namespace \
--set crds.enabled=true --wait
Installs the slurm OCI chart (v1.0.1) with the generated values file:
# What the script runs:
helm install slurm \
oci://ghcr.io/slinkyproject/charts/slurm \
--values=slurm-values.yaml \
--version=1.0.1 --namespace=slurm
slurm-login-slinky service to existcurl https://checkip.amazonaws.comslurm-login-service-patch.yaml from the template (substitutes
${ip_address} for NLB source range restriction)# What the script runs:
IP_ADDRESS=$(curl -s https://checkip.amazonaws.com)
sed "s|\${ip_address}|${IP_ADDRESS}|g" \
slurm-login-service-patch.yaml.template \
> slurm-login-service-patch.yaml
kubectl patch service slurm-login-slinky -n slurm \
--patch-file slurm-login-service-patch.yaml
Usage: install.sh [OPTIONS]
Optional:
--skip-setup Use previously generated slurm-values.yaml
--region <region> AWS region (default: AWS CLI configured or us-west-2)
--help Show help
Options passed through to setup.sh:
--instance-type <type> SageMaker instance type (e.g., ml.g5.8xlarge, ml.p5.48xlarge)
--instance-count <N> Number of compute node instances (default: 4)
--infra <cfn|tf> Infrastructure method for CodeBuild stack
--repo-name <name> ECR repository name
--tag <tag> Image tag
--local-build Build image locally instead of CodeBuild
--skip-build Skip image build (use existing image in ECR)
After install.sh completes, verify with:
# Check cert-manager is running
kubectl -n cert-manager get pods
# Check LB Controller is running
kubectl -n kube-system get pods -l app.kubernetes.io/name=aws-load-balancer-controller
# Check all Slurm pods are Running
kubectl -n slurm get pods
kubectl -n slinky get pods
kubectl -n mariadb get pods
# Check FSx PVC is bound
kubectl get pvc fsx-claim -n slurm
# Check the login service has an NLB endpoint
kubectl get svc slurm-login-slinky -n slurm
# SSH into the login node
ssh -i ~/.ssh/id_ed25519_slurm root@<NLB_HOSTNAME>
# On the login node, verify Slurm
sinfo # Shows nodes and partitions
squeue # Shows job queue (should be empty)
See the validate-deployment skill for comprehensive post-deployment
validation.
| Symptom | Cause | Fix |
|---|---|---|
| cert-manager webhook timeout | CRDs not installed or resource constraints | Check: kubectl -n cert-manager get pods; verify --set crds.enabled=true |
LB Controller unable to resolve at least one subnet | Public subnets missing kubernetes.io/role/elb=1 tag | Run the subnet tagging step in install.sh or tag manually |
| Pod Identity association fails | EKS Pod Identity addon not enabled | Enable via: aws eks create-addon --cluster-name <name> --addon-name eks-pod-identity-agent |
| MariaDB wait times out | EBS CSI driver not installed or gp3 StorageClass missing | Check: kubectl get sc gp3; see HyperPod EBS CSI docs |
| Slurm operator install fails | OCI registry unreachable | Verify: helm show chart oci://ghcr.io/slinkyproject/charts/slurm-operator --version 1.0.1 |
slurm-values.yaml not found with --skip-setup | Values file not generated | Run setup.sh first, or remove --skip-setup |
EKS_CLUSTER_NAME not set | env_vars.sh missing or not sourced | Run deploy.sh first, or manually create env_vars.sh |
| NLB hostname not available | NLB provisioning still in progress | Wait 2-3 minutes and check: kubectl get svc slurm-login-slinky -n slurm |
| SSH connection refused | NLB not fully provisioned or SSH key mismatch | Wait for NLB DNS to propagate; verify key: ssh -i ~/.ssh/id_ed25519_slurm -v root@<hostname> |
| Helm install says "already installed" | Idempotent check -- previous install exists | To reinstall: helm uninstall slurm -n slurm then re-run |
| CRD conflicts during upgrade | Stale CRDs from older Slinky version | Script handles this (deletes stale CRDs automatically) |
curl: checkip.amazonaws.com fails | No internet access from the machine | Manually set IP and render the patch template |
install.sh -- Main cluster installation scriptdestroy.sh -- Reverse teardown (LB Controller + Pod Identity + IAM,
cert-manager, FSx PVC, MariaDB, Slurm)setup.sh -- Called internally for image build and values generationmariadb.yaml -- MariaDB custom resource definitionlustre-pvc-slurm.yaml.template -- FSx Lustre PV/PVC templateslurm-values.yaml.template -- Helm values templateslurm-login-service-patch.yaml.template -- NLB service patch templatelib/deploy_helpers.sh -- resolve_helm_profile() (used by setup.sh)