一键导入
deployment-preflight
// Validate all prerequisites (CLI tools, AWS credentials, environment variables, kubectl context) before running slinky-slurm deployment scripts
// Validate all prerequisites (CLI tools, AWS credentials, environment variables, kubectl context) before running slinky-slurm deployment scripts
Patterns for unit testing bash scripts using bats-core, including AWS CLI mocking, jq/sed/awk testing, and cross-platform portability
Build the Slurm compute node container image via CodeBuild or local Docker, generate SSH keys, and render Helm values using setup.sh
Deploy HyperPod EKS infrastructure using deploy.sh via CloudFormation or Terraform, including AZ resolution, stack outputs, and kubeconfig setup
Install cert-manager, AWS LB Controller (Pod Identity), MariaDB, Slurm operator, and Slurm cluster on HyperPod EKS using install.sh, including subnet tagging and NLB configuration for SSH access
Post-deployment health checks for slinky-slurm including pod status, Slurm node registration, SSH connectivity, and test job submission
| name | deployment-preflight |
| description | Validate all prerequisites (CLI tools, AWS credentials, environment variables, kubectl context) before running slinky-slurm deployment scripts |
Use this skill before running any slinky-slurm deployment script (deploy.sh,
setup.sh, install.sh). It validates that all prerequisites are met and
helps diagnose common configuration issues that would cause deployment failures.
The slinky-slurm project deploys Slurm on Amazon SageMaker HyperPod EKS via the Slinky Project (SchedMD). The deployment workflow has three phases:
deploy.sh) -- provisions HyperPod EKS via CFN or TFsetup.sh) -- builds the Slurmd container image and
generates Helm valuesinstall.sh) -- deploys MariaDB, Slurm operator,
and Slurm cluster via HelmEach phase has different prerequisites. This skill covers all of them.
deploy.sh)Required CLI tools:
# Always required
command -v aws # AWS CLI v2
command -v jq # Required for --infra cfn
command -v terraform # Required for --infra tf
Required state:
aws sts get-caller-identity --region <region>
usw2-az2 for us-west-2)Environment variables: None required -- deploy.sh accepts all config
via command-line flags.
setup.sh)Required CLI tools:
# Always required
command -v aws
command -v sed
# Required for --local-build
command -v docker
# Required for CodeBuild path (default) with --infra cfn
command -v jq
command -v zip
# Required for CodeBuild path with --infra tf
command -v terraform
command -v zip
Required state:
--skip-build: ECR image must already exist:
aws ecr describe-images \
--repository-name dlc-slurmd \
--image-ids imageTag=25.11.1-ubuntu24.04 \
--region <region>
--local-build: Docker daemon running, access to DLC ECR registry
(763104351884.dkr.ecr.us-east-1.amazonaws.com)env_vars.sh should exist if available (sourced
for AWS_ACCOUNT_ID, AWS_REGION)install.sh)Required CLI tools:
command -v aws
command -v kubectl
command -v helm
command -v curl # Used for public IP detection (checkip.amazonaws.com)
command -v sed
Required state:
kubectl context points to the correct HyperPod EKS cluster:
# Set context after deploy.sh completes
source env_vars.sh
aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $AWS_REGION
# Verify
kubectl cluster-info
kubectl get nodes
slurm-values.yaml exists (generated by setup.sh) unless running
install.sh without --skip-setupRun these checks before starting a deployment:
for tool in aws kubectl helm jq docker curl sed; do
if command -v "$tool" &>/dev/null; then
echo " $tool: OK ($(command -v "$tool"))"
else
echo " $tool: MISSING"
fi
done
aws sts get-caller-identity
# Expected: JSON with Account, Arn, UserId
AWS_REGION="us-west-2" # or your target region
# List available AZ IDs
aws ec2 describe-availability-zones \
--region "${AWS_REGION}" \
--filters "Name=opt-in-status,Values=opt-in-not-required" \
--query "AvailabilityZones[?ZoneType=='availability-zone'].ZoneId | sort(@) | [:5]" \
--output text
# Expected: tab-separated AZ IDs like "usw2-az1 usw2-az2 usw2-az3 usw2-az4"
# Should point to the HyperPod EKS cluster
kubectl cluster-info
kubectl get nodes
# Expected: Nodes with STATUS Ready, instance types ml.m5.4xlarge, ml.g5.8xlarge or ml.p5.48xlarge
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws ecr describe-images \
--repository-name dlc-slurmd \
--image-ids imageTag=25.11.1-ubuntu24.04 \
--region "${AWS_REGION}"
# MariaDB operator chart
helm repo add mariadb-operator \
https://mariadb-operator.github.io/mariadb-operator
helm repo update
# OCI charts (Slinky) -- these are verified during install
helm show chart oci://ghcr.io/slinkyproject/charts/slurm-operator --version 1.0.1
All preflight checks pass when:
aws sts get-caller-identity returns valid JSONkubectl get nodes shows Ready nodes (Phase 3)--skip-build)| Symptom | Cause | Fix |
|---|---|---|
aws sts get-caller-identity fails | Expired or missing credentials | Run aws configure or refresh SSO: aws sso login |
| AZ ID not in list | Wrong region or non-standard AZ | Use --az-id with a valid ID from the resolved list |
kubectl get nodes shows no nodes | Wrong kubectl context | Run aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $AWS_REGION |
kubectl get nodes shows NotReady | HyperPod nodes still provisioning | Wait 5-10 minutes after deploy.sh completes |
| Docker login fails for DLC ECR | Region mismatch or expired token | DLC registry is always in us-east-1: aws ecr get-login-password --region us-east-1 |
helm repo add fails | Network/proxy issue | Check internet connectivity and proxy settings |
jq: command not found | jq not installed | brew install jq (macOS) or apt install jq (Linux) |
terraform: command not found | terraform not installed | Install from https://developer.hashicorp.com/terraform/install |
deploy.sh -- Infrastructure deployment script (lines 124-146: prerequisite checks)setup.sh -- Image build and values generation (lines 114-127: argument validation)install.sh -- Cluster installation (lines 61-89: argument parsing)lib/deploy_helpers.sh -- check_command() function (lines 79-85)lib/deploy_helpers.sh -- validate_az_id() function (lines 94-103)