con un clic
deploy-infrastructure
// Deploy HyperPod EKS infrastructure using deploy.sh via CloudFormation or Terraform, including AZ resolution, stack outputs, and kubeconfig setup
// Deploy HyperPod EKS infrastructure using deploy.sh via CloudFormation or Terraform, including AZ resolution, stack outputs, and kubeconfig setup
Patterns for unit testing bash scripts using bats-core, including AWS CLI mocking, jq/sed/awk testing, and cross-platform portability
Build the Slurm compute node container image via CodeBuild or local Docker, generate SSH keys, and render Helm values using setup.sh
Install cert-manager, AWS LB Controller (Pod Identity), MariaDB, Slurm operator, and Slurm cluster on HyperPod EKS using install.sh, including subnet tagging and NLB configuration for SSH access
Validate all prerequisites (CLI tools, AWS credentials, environment variables, kubectl context) before running slinky-slurm deployment scripts
Post-deployment health checks for slinky-slurm including pod status, Slurm node registration, SSH connectivity, and test job submission
| name | deploy-infrastructure |
| description | Deploy HyperPod EKS infrastructure using deploy.sh via CloudFormation or Terraform, including AZ resolution, stack outputs, and kubeconfig setup |
Use this skill to deploy the HyperPod EKS infrastructure that hosts the Slurm
cluster. This is Phase 1 of the slinky-slurm deployment workflow and must
complete before running setup.sh or install.sh.
The deploy.sh script supports two infrastructure backends:
--infra cfn) -- uses an AWS-hosted S3 template--infra tf) -- uses local Terraform modulesDeployment takes approximately 20-30 minutes and provisions:
Before running deploy.sh, verify:
--infra cfn) or terraform installed (for
--infra tf)hp-eks-slinky-stack)See the deployment-preflight skill for detailed prerequisite validation.
Any SageMaker-supported GPU instance type can be used. GPU count, model, and EFA interfaces are auto-discovered via the EC2 API. Two common configurations:
| Instance Type | Count | GPUs | EFA |
|---|---|---|---|
ml.g5.8xlarge (default) | 4 | 1 A10G per node | 1 per node |
ml.p5.48xlarge | 2 | 8 H100 per node | 32 per node |
CloudFormation path (recommended for first-time users):
# Deploy g5 in us-west-2 (defaults)
bash deploy.sh --instance-type ml.g5.8xlarge --infra cfn
# Deploy p5 in us-east-1 with a specific AZ
bash deploy.sh --instance-type ml.p5.48xlarge --instance-count 2 --infra cfn --region us-east-1 --az-id use1-az2
# Deploy with custom stack name
bash deploy.sh --instance-type ml.g5.8xlarge --infra cfn --stack-name my-slinky-stack
Terraform path:
# Deploy g5 using Terraform
bash deploy.sh --instance-type ml.g5.8xlarge --infra tf
# Deploy p5 in us-east-1
bash deploy.sh --instance-type ml.p5.48xlarge --instance-count 2 --infra tf --region us-east-1 --az-id use1-az2
The Terraform path will show a plan and prompt for confirmation before applying.
After deploy.sh completes, it writes env_vars.sh containing stack
outputs:
source env_vars.sh
This sets:
AWS_ACCOUNT_IDAWS_REGIONEKS_CLUSTER_NAMEVPC_IDPRIVATE_SUBNET_IDSECURITY_GROUP_IDSTACK_IDaws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $AWS_REGION
kubectl cluster-info
kubectl get nodes
Expected output: nodes with instance types matching the chosen profile
(ml.m5.4xlarge for management, ml.g5.8xlarge or ml.p5.48xlarge for
accelerated).
lib/deploy_helpers.shresolve_instance_profile() to set instance type/count for the profileaws sts get-caller-identityaws ec2 describe-availability-zones--az-id against the resolved listparams.json (40 CloudFormation parameters, g5 defaults)resolve_cfn_params() to substitute AZ IDs and instance overridesvalidate_cfn_template() to:
aws cloudformation validate-template --template-urlaws cloudformation create-stack (or update-stack for existing
stacks) with the S3-hosted HyperPod templateenv_vars.shcustom.tfvars to the terraform-modules directoryresolve_tf_vars() to patch region, AZ, and instance overridesterraform init, plan, and apply (with user confirmation)terraform_outputs.sh to extract outputs to env_vars.shUsage: deploy.sh --instance-type <ml.X.Y> --infra <cfn|tf> [OPTIONS]
Required:
--instance-type <type> SageMaker instance type (e.g. ml.g5.8xlarge)
--infra <cfn|tf> Infrastructure deployment method
Optional:
--instance-count <N> Number of accelerated instances (default: 4)
--region <region> AWS region (default: us-west-2)
--az-id <az-id> Availability zone ID (default: usw2-az2)
--stack-name <name> CFN stack name (default: hp-eks-slinky-stack)
--help Show help
Deployment is successful when:
deploy.sh exits with code 0env_vars.sh exists and contains all 7 exported variableskubectl get nodes shows nodes in Ready statekubectl cluster-info returns the EKS API server endpoint# Quick verification
source env_vars.sh
echo "Cluster: $EKS_CLUSTER_NAME"
echo "Region: $AWS_REGION"
kubectl get nodes -o wide
| Symptom | Cause | Fix |
|---|---|---|
| Stack creation fails with capacity error | AZ doesn't have capacity for the instance type | Try a different --az-id |
ROLLBACK_COMPLETE status | CFN template parameter issue | Check CloudFormation events: aws cloudformation describe-stack-events --stack-name <name> |
| Terraform plan fails | Missing terraform-modules directory | Ensure the full awsome-distributed-training repo is cloned, not just the slinky-slurm subdirectory |
env_vars.sh not created | Script failed before output extraction | Check the script output for errors and re-run |
| AZ ID validation warning | Specified AZ not in the region | Use one of the AZ IDs shown in the "Available AZs" message |
| Stack already exists | Previous deployment not cleaned up | Run destroy.sh first, or use a different --stack-name |
| Template validation fails | S3-hosted template unreachable or params mismatch | Check network/credentials; review the missing/extra parameter warnings in the output |
deploy.sh -- Main infrastructure deployment scriptlib/deploy_helpers.sh -- Helper functions (resolve_instance_profile,
resolve_cfn_params, resolve_tf_vars, validate_az_id,
validate_cfn_template, check_command)params.json -- CloudFormation parameters (40 params, g5 defaults)custom.tfvars -- Terraform variables (g5 defaults)