| name | skypilot-multi-cloud-orchestration |
| description | Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| tags | ["Infrastructure","Multi-Cloud","Orchestration","GPU","Cost Optimization","SkyPilot"] |
| dependencies | ["skypilot>=0.7.0"] |
SkyPilot Multi-Cloud Orchestration
Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot.
When to use SkyPilot
Use SkyPilot when:
- Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.)
- Need cost optimization with automatic cloud/region selection
- Running long jobs on spot instances with auto-recovery
- Managing distributed multi-node training
- Want unified interface for 20+ cloud providers
- Need to avoid vendor lock-in
Key features:
- Multi-cloud: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers
- Cost optimization: Automatic cheapest cloud/region selection
- Spot instances: 3-6x cost savings with automatic recovery
- Distributed training: Multi-node jobs with gang scheduling
- Managed jobs: Auto-recovery, checkpointing, fault tolerance
- Sky Serve: Model serving with autoscaling
Use alternatives instead:
- Modal: For simpler serverless GPU with Python-native API
- RunPod: For single-cloud persistent pods
- Kubernetes: For existing K8s infrastructure
- Ray: For pure Ray-based orchestration
Quick start
Installation
pip install "skypilot[aws,gcp,azure,kubernetes]"
sky check
Hello World
Create hello.yaml:
resources:
accelerators: T4:1
run: |
nvidia-smi
echo "Hello from SkyPilot!"
Launch:
sky launch -c hello hello.yaml
ssh hello
sky down hello
Core concepts
Task YAML structure
name: my-task
resources:
cloud: aws
region: us-west-2
accelerators: A100:4
cpus: 8+
memory: 32+
use_spot: true
disk_size: 256
num_nodes: 2
workdir: .
setup: |
pip install -r requirements.txt
run: |
python train.py
Key commands
| Command | Purpose |
|---|
sky launch | Launch cluster and run task |
sky exec | Run task on existing cluster |
sky status | Show cluster status |
sky stop | Stop cluster (preserve state) |
sky down | Terminate cluster |
sky logs | View task logs |
sky queue | Show job queue |
sky jobs launch | Launch managed job |
sky serve up | Deploy serving endpoint |
GPU configuration
Available accelerators
accelerators: T4:1
accelerators: L4:1
accelerators: A10G:1
accelerators: L40S:1
accelerators: A100:4
accelerators: A100-80GB:8
accelerators: H100:8
accelerators: V100:4
accelerators: TPU-v4-8
GPU fallbacks
resources:
accelerators:
H100: 8
A100-80GB: 8
A100: 8
any_of:
- cloud: gcp
- cloud: aws
- cloud: azure
Spot instances
resources:
accelerators: A100:8
use_spot: true
spot_recovery: FAILOVER
Cluster management
Launch and execute
sky launch -c mycluster task.yaml
sky exec mycluster another_task.yaml
ssh mycluster
sky logs mycluster
Autostop
resources:
accelerators: A100:4
autostop:
idle_minutes: 30
down: true
sky autostop mycluster -i 30 --down
Cluster status
sky status
sky status -a
Distributed training
Multi-node setup
resources:
accelerators: A100:8
num_nodes: 4
setup: |
pip install torch torchvision
run: |
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) \
--master_port=12355 \
train.py
Environment variables
| Variable | Description |
|---|
SKYPILOT_NODE_RANK | Node index (0 to num_nodes-1) |
SKYPILOT_NODE_IPS | Newline-separated IP addresses |
SKYPILOT_NUM_NODES | Total number of nodes |
SKYPILOT_NUM_GPUS_PER_NODE | GPUs per node |
Head-node-only execution
run: |
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
python orchestrate.py
fi
Managed jobs
Spot recovery
sky jobs launch -n my-job train.yaml
Checkpointing
name: training-job
file_mounts:
/checkpoints:
name: my-checkpoints
store: s3
mode: MOUNT
resources:
accelerators: A100:8
use_spot: true
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume-from-latest
Job management
sky jobs queue
sky jobs logs my-job
sky jobs cancel my-job
File mounts and storage
Local file sync
workdir: ./my-project
file_mounts:
/data/config.yaml: ./config.yaml
~/.vimrc: ~/.vimrc
Cloud storage
file_mounts:
/datasets:
source: s3://my-bucket/datasets
mode: MOUNT
/models:
source: gs://my-bucket/models
mode: COPY
/outputs:
name: my-outputs
store: s3
mode: MOUNT_CACHED
Storage modes
| Mode | Description | Best For |
|---|
MOUNT | Stream from cloud | Large datasets, read-heavy |
COPY | Pre-fetch to disk | Small files, random access |
MOUNT_CACHED | Cache with async upload | Checkpoints, outputs |
Sky Serve (Model Serving)
Basic service
service:
readiness_probe: /health
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
resources:
accelerators: A100:1
run: |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000
sky serve up -n my-service service.yaml
sky serve status
sky serve status my-service
Autoscaling policies
service:
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
upscale_delay_seconds: 60
downscale_delay_seconds: 300
load_balancing_policy: round_robin
Cost optimization
Automatic cloud selection
resources:
accelerators: A100:8
sky launch task.yaml --dryrun
Cloud preferences
resources:
accelerators: A100:8
any_of:
- cloud: gcp
region: us-central1
- cloud: aws
region: us-east-1
- cloud: azure
Environment variables
envs:
HF_TOKEN: $HF_TOKEN
WANDB_API_KEY: $WANDB_API_KEY
secrets:
- HF_TOKEN
- WANDB_API_KEY
Common workflows
Workflow 1: Fine-tuning with checkpoints
name: llm-finetune
file_mounts:
/checkpoints:
name: finetune-checkpoints
store: s3
mode: MOUNT_CACHED
resources:
accelerators: A100:8
use_spot: true
setup: |
pip install transformers accelerate
run: |
python train.py \
--checkpoint-dir /checkpoints \
--resume
Workflow 2: Hyperparameter sweep
name: hp-sweep-${RUN_ID}
envs:
RUN_ID: 0
LEARNING_RATE: 1e-4
BATCH_SIZE: 32
resources:
accelerators: A100:1
use_spot: true
run: |
python train.py \
--lr $LEARNING_RATE \
--batch-size $BATCH_SIZE \
--run-id $RUN_ID
for i in {1..10}; do
sky jobs launch sweep.yaml \
--env RUN_ID=$i \
--env LEARNING_RATE=$(python -c "import random; print(10**random.uniform(-5,-3))")
done
Debugging
ssh mycluster
sky logs mycluster
sky queue mycluster
sky jobs logs my-job
Common issues
| Issue | Solution |
|---|
| Quota exceeded | Request quota increase, try different region |
| Spot preemption | Use sky jobs launch for auto-recovery |
| Slow file sync | Use MOUNT_CACHED mode for outputs |
| GPU not available | Use any_of for fallback clouds |
References
Resources