| name | skypilot |
| description | Use when launching cloud VMs, Kubernetes pods, or Slurm jobs for GPU/TPU/CPU workloads, training or fine-tuning models on cloud GPUs, deploying inference servers (vllm, TGI, etc.) with autoscaling, writing or debugging SkyPilot task YAML files, using spot/preemptible instances for cost savings, comparing GPU prices across clouds, managing compute across 25+ clouds, Kubernetes, Slurm, and on-prem clusters with failover between them, troubleshooting resource availability or SkyPilot errors, or optimizing cost and GPU availability. |
SkyPilot Skill
SkyPilot is a unified framework to run AI workloads on any cloud, Slurm or Kubernetes. It provides a single interface to launch clusters, run jobs, and serve models across 25+ clouds (AWS, GCP, Azure, Coreweave, Nebius, Lambda, Together AI, RunPod, and more), Kubernetes clusters, and Slurm clusters.
When to Use SkyPilot
Use SkyPilot when you need to:
- Manage compute resources on any cloud, Slurm, or Kubernetes cluster
- Launch CPU/GPU/TPU (GB300, GB200, B200, H200, H100, etc.) on any cloud, Kubernetes or Slurm
- Run training, fine-tuning, or batch inference jobs
- Serve models with autoscaling and multi-cloud replicas (SkyServe)
- Run long-running jobs with automatic lifecycle management and recovery (managed jobs)
- Find the cheapest or most available GPU across clouds
Don't use SkyPilot for:
- Local-only workloads (use Docker/conda directly)
Capabilities: When to Use What
SkyPilot has three core abstractions. Use the right one for each stage of your workflow:
1. SkyPilot Clusters (sky launch / sky exec) — Interactive development and debugging
- Use during initial development, debugging, and experimentation
- Launch a cluster, SSH in or connect VSCode/Cursor (
code --remote ssh-remote+CLUSTER), iterate quickly
- Cluster stays up until you stop/down it or autostop triggers
- Best for: prototyping, debugging, short experiments
2. Managed Jobs (sky jobs launch) — Long-running training and batch jobs
- Use when submitting long-running jobs that should run unattended
- Manages the full lifecycle: provisioning, execution, recovery, and teardown
- Automatically recovers from spot preemptions, quota limits, and transient failures
- Works across clouds, Kubernetes, and Slurm (handles preemptions and quota)
- Best for: training runs, fine-tuning, hyperparameter sweeps, batch inference
3. SkyServe (sky serve up) — Production model serving
- Use when serving models at scale with autoscaling
- Start with
sky launch + open port to test your serving setup, then use sky serve up to scale
- Provides load balancing, autoscaling, and multi-cloud replicas
- Best for: model serving endpoints, API services
Before You Start (Agent Bootstrap)
Bootstrap to confirm SkyPilot is installed, connected to an API server, and has cloud credentials. Once confirmed, skip straight to the user's task.
Step 1: Check installation and API server connectivity
sky api info
| Output contains | Meaning | Next action |
|---|
| Server version and status | Server is running and connected | Bootstrap done. Skip to user's task. |
No SkyPilot API server is connected | No server connected | Go to "Start or connect a server" below. |
Could not connect to SkyPilot API server | Remote server unreachable or auth expired | Tell the user and suggest sky api login --relogin -e <endpoint> to reconnect. |
command not found: sky | SkyPilot not installed | Go to "Install SkyPilot" below. |
Install SkyPilot (only if sky command not found):
pip install "skypilot[aws,gcp,kubernetes]"
Ask the user which clouds they need if unclear, then re-run sky api info.
Start or connect a server (only if "not running"):
Ask the user:
Do you have an existing SkyPilot API server to connect to, or should I start one locally?
- Connect to existing server:
sky api login -e <API_SERVER_URL> — get the URL from the user.
- Start locally:
sky api start
After either path, re-run sky api info to confirm the server is reachable.
Step 2: Check cloud credentials (only for fresh setups — skip if the server was already running)
sky check -o json
This shows which clouds are enabled or disabled. If the user's target cloud is not enabled, guide them through credential setup (see Troubleshooting).
Essential Commands
Use -o json with status/query commands to get structured JSON output instead of tables.
Clusters — interactive development and debugging:
| Command | Description |
|---|
sky launch -c NAME task.yaml | Launch a cluster or run a task |
sky exec NAME task.yaml | Run task on existing cluster (skips provisioning); syncs workdir each time |
sky exec NAME task.yaml -d | Same, but detach immediately (don't stream logs) |
sky status -o json | Show all clusters |
sky logs NAME | Stream job logs from a cluster |
sky logs NAME --no-follow | Print existing logs and exit immediately |
sky logs NAME --tail 50 | Print last 50 lines of logs and exit |
sky logs NAME --status | Exit with code 0=succeeded, 100=failed, 101=not finished, 102=not found, 103=cancelled |
sky queue NAME -o json | List jobs on a cluster with status (structured JSON) |
sky stop NAME / sky start NAME | Stop/restart to save costs (preserves disk) |
sky down NAME | Tear down a cluster completely |
sky gpus list -o json | List available GPU types across clouds |
Managed Jobs — long-running unattended workloads:
| Command | Description |
|---|
sky jobs launch task.yaml | Launch a managed job (auto lifecycle + recovery) |
sky jobs queue -o json | Show all managed jobs and their status |
sky jobs logs JOB_ID | Stream logs from a managed job |
sky jobs cancel JOB_ID | Cancel a managed job |
SkyServe — model serving with autoscaling:
| Command | Description |
|---|
sky serve up serve.yaml -n NAME | Start a model serving service |
sky serve status NAME | Show service status and endpoint URL |
sky serve update NAME new.yaml | Update a running service (rolling) |
sky serve down NAME | Tear down a service |
For complete CLI reference, see CLI Reference.
Quick Start
sky launch -c mycluster --gpus H100 -- nvidia-smi
sky launch -c mycluster task.yaml
ssh mycluster
code --remote ssh-remote+mycluster /home/user/sky_workdir
sky down mycluster
Task YAML Structure
The task YAML is SkyPilot's primary interface. All fields are optional.
name: my-training-job
workdir: .
num_nodes: 1
resources:
accelerators: H200:8
use_spot: false
disk_size: 256
ports: 8080
envs:
MODEL_NAME: my-model
BATCH_SIZE: 32
setup: |
pip install torch transformers
run: |
python train.py --model $MODEL_NAME --batch-size $BATCH_SIZE
For complete YAML schema including file mounts, environment variables set by SkyPilot, and advanced fields, see YAML Specification.
GPU and Cloud Selection
IMPORTANT: Let SkyPilot choose the cloud and region. Do NOT manually pick a cloud/region/instance by parsing sky gpus list output. SkyPilot's optimizer automatically selects the cheapest available option across all enabled clouds. Only specify infra: when the user explicitly requests a specific cloud or region.
Default behavior (recommended): Just specify the GPU type. SkyPilot finds the cheapest cloud/region automatically:
resources:
accelerators: H200:8
If the user doesn't specify a GPU type, ask them what GPU they need (or what model/workload they're running so you can recommend one). Do NOT run sky gpus list and pick for them — present options and let the user decide, or use any_of to let SkyPilot maximize availability:
resources:
any_of:
- accelerators: H100:8
- accelerators: A100-80GB:8
- accelerators: A100:8
Use ordered only when the user has a strict preference:
resources:
ordered:
- infra: aws/us-east-1
accelerators: H100:8
- infra: gcp/us-central1
accelerators: H100:8
- infra: aws/us-west-2
accelerators: A100-80GB:8
Only set infra: when the user explicitly says something like "use AWS" or "run on GCP us-central1":
resources:
infra: aws
accelerators: H100:8
Cluster Lifecycle
sky launch -c mycluster task.yaml
sky launch -c mycluster task.yaml -i 30
sky launch -c mycluster task.yaml -i 30 --down
sky launch -c mycluster task.yaml --env MODEL_NAME=llama3 --env BATCH_SIZE=64
sky exec mycluster another_task.yaml
sky exec mycluster -- python train.py --epochs 10
sky autostop mycluster -i 30
sky autostop mycluster -i 30 --down
sky stop mycluster
sky start mycluster
sky down mycluster
Workdir Sync Behavior
workdir: is synced to ~/sky_workdir on the remote via rsync before every sky exec. rsync is additive — deleted local files are NOT removed from the remote. This can cause experiments to run against stale build artifacts or old configs.
To ensure a clean slate, SSH and wipe before sky exec:
ssh mycluster "rm -rf ~/sky_workdir"
sky exec mycluster task.yaml
Or clean inside run: if only specific artifacts need removal:
run: |
find ~/sky_workdir/build -name '*.o' -delete 2>/dev/null || true
cd ~/sky_workdir && make
Managed Jobs
Use sky jobs launch for long-running jobs that should run unattended. SkyPilot manages the full lifecycle — provisioning, execution, recovery from preemptions/quota/failures, and teardown:
name: training-job
resources:
accelerators: A100:8
run: |
python train.py --resume-from-checkpoint
sky jobs launch managed-job.yaml
sky jobs queue -o json
sky jobs logs <job_id>
sky jobs cancel <job_id>
Checkpoint pattern: Your training script should save checkpoints to persistent storage (cloud bucket or volume) and resume from the latest checkpoint on restart. SkyPilot handles the cluster recovery; your script handles the state recovery.
SkyServe: Model Serving
resources:
accelerators: A100:1
ports: 8080
run: |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8080
service:
readiness_probe: /v1/models
replica_policy:
min_replicas: 1
max_replicas: 3
target_qps_per_replica: 5
sky serve up serve.yaml -n my-llm
sky serve status my-llm
sky serve status my-llm --endpoint
sky serve update my-llm new-serve.yaml
sky serve down my-llm
Common Workflows
Fine-Tuning Workflow
- Write task YAML with
setup (install deps) and run (training command)
- Use
file_mounts or workdir to sync code
sky launch -c train task.yaml to launch
sky logs train to monitor
sky exec train -- python eval.py to evaluate on same cluster
sky down train when done
Hyperparameter Sweep
- Create parameterized YAML with
envs
- Launch multiple managed jobs:
for lr in 1e-4 1e-5 1e-6; do
sky jobs launch sweep.yaml --env LR=$lr --name sweep-lr-$lr
done
- Monitor with
sky jobs queue -o json
Model Serving Deployment
- Write serve YAML with
service: section
sky serve up serve.yaml -n my-service
- Get endpoint:
sky serve status my-service --endpoint
- Update model:
sky serve update my-service updated.yaml
Parallel Experiment Submission
Use sky exec -d to submit jobs to multiple VMs without blocking, then collect results:
for i in 1 2 3 4; do
sky exec exp-vm-0$i task.yaml --env LR=1e-$i -d
done
job_id=$(sky queue exp-vm-01 -o json \
| python3 -c "import sys, json; jobs = json.load(sys.stdin).get('exp-vm-01', []); print(max(j['job_id'] for j in jobs) if jobs else '')")
sky logs exp-vm-01 $job_id --status && sky logs exp-vm-01 $job_id --tail 50
sky queue exp-vm-01 -o json
Agent Feedback Loop
When using SkyPilot programmatically, follow this loop:
- Validate:
sky launch --dryrun task.yaml (check resource availability/cost)
- Launch:
sky launch -c mycluster task.yaml
- Monitor:
sky status -o json and sky queue mycluster -o json
- Wait for completion:
sky logs mycluster <JOB_ID> (streams logs so you can observe progress and react to stalls; blocks until job finishes; get JOB_ID from sky queue mycluster -o json). For long-running jobs where you don't need intermediate output, use sky logs mycluster <JOB_ID> --status instead (blocks silently, exits 0 on success).
- Inspect output:
sky logs mycluster <JOB_ID> --no-follow or sky logs mycluster <JOB_ID> --tail 100
- Debug:
ssh mycluster (interactive)
- Iterate:
sky exec mycluster updated_task.yaml (run on existing cluster)
- Cleanup:
sky down mycluster
Never poll with sleep + sky queue — use sky logs CLUSTER JOB_ID to stream logs and block until done. Use --status if you only need the exit code, or --tail N to fetch recent output after completion.
Common Agent Mistakes
| Mistake | Why it's wrong | Do this instead |
|---|
Manually picking cloud/region from sky gpus list output | SkyPilot optimizer does this automatically and better | Just set accelerators: and let SkyPilot choose |
Using sky launch for long-running unattended jobs | No recovery if preempted or interrupted | Use sky jobs launch for unattended work |
Forgetting sky down or autostop after work is done | Wastes money on idle clusters | Always clean up, or use -i <minutes> --down at launch |
Hardcoding infra: aws without user asking | Limits availability and increases cost | Only set infra: when user explicitly requests a cloud |
Not using envs: for configurable values | Hard to reuse or override from CLI | Use envs: in YAML + --env KEY=VAL for parameterization |
Running sky launch without -c <name> | Creates randomly-named cluster, hard to reference | Always name clusters with -c |
| Parsing table output from status commands | Table formatting is for humans, fragile to parse | Use -o json for structured output |
Using deprecated cloud:/region:/zone: fields | Deprecated in favor of infra: | Use infra: aws/us-east-1 instead |
Polling job status with sleep + sky queue | Wastes tokens, introduces timing bugs, fragile | Use sky logs CLUSTER JOB_ID --status to block until done |
| Assuming workdir sync removes remote files | rsync is additive; old remote files persist across sky exec calls | SSH and manually clean ~/sky_workdir, or clean in run: script |
Not using --tail when only last output matters | Streaming full logs wastes tokens for long jobs | Use sky logs CLUSTER JOB_ID --tail 50 for last N lines |
Common Issues Quick Reference
| Issue | Solution |
|---|
| GPU not available | Use any_of for fallback, or try different regions/clouds |
| Setup takes too long | SkyPilot caches setup; use sky exec to skip it on reruns |
| Task fails silently | Check sky logs <cluster> or ssh <cluster> to debug |
| Cluster stuck in INIT | sky down <cluster> and relaunch |
| Preemption/quota | Use sky jobs launch for automatic recovery and lifecycle management |
| Port not accessible | Ensure ports: is set in resources and security groups allow traffic |
| File sync slow | Use cloud bucket mounts instead of workdir for large datasets |
| Credentials error | Run sky check -o json and inspect which clouds are disabled |
References
For detailed reference documentation: