| name | runpod-vllm-deploy |
| description | Deploy or refresh a RunPod vLLM pod for Alysse production, mirroring the B3G dev server config (Cydonia-24B-v4.3-AWQ + transformers 5.5.0). Picks the cheapest GPU under $0.18/hr (RTX A5000 24GB), installs the exact stack that produces the chat quality the user has tuned for, and updates the Fly.io VLLM_BASE_URL secret. TRIGGER when: user says "create runpod pod", "redeploy vllm", "fix prod chat", "the runpod is dead", or "match B3G config". |
| origin | alysse-internal |
RunPod vLLM Deploy — Hard-Won Recipe
This skill captures every lesson learned during the Apr–May 2026 production migration off B3G onto RunPod. Each rule below cost at least one failed deployment or wrong-output incident. Follow them verbatim.
Why this skill exists
The user spent weeks tuning chat quality on B3G (RTX 5090 32GB, on-prem). When we replaced B3G with RunPod RTX A5000 in production, chat quality collapsed even though we copied the vLLM args verbatim. Three separate root causes all had to be fixed:
- Wrong bucket name in env (R2 silently fell back to local FS → ephemeral images)
- vLLM version too old (0.7.3 vs B3G 0.17.0 → different chat template handling)
- transformers version too old (4.57.6 vs B3G 5.5.0 → completely different output character)
The third one was the killer and the agent in the prior session said "transformers==5.5.0 doesn't exist on PyPI" — that was wrong. It does exist; install it directly.
Step 0 — Confirm reference: SSH B3G if reachable
ssh -i ~/.ssh/b3g_key b3g@192.168.89.106 \
"/home/b3g/vllm_env/bin/pip show transformers vllm torch 2>&1 | grep Version"
Expected output:
Version: 5.5.0 # transformers
Version: 0.17.0 # vllm
Version: 2.10.0 # torch
If B3G unreachable (VPN off): use these versions anyway — they are the canonical reference.
The Cydonia startup script on B3G is:
/home/b3g/start-lexIO-start.sh (Cydonia-24B-v4.3-AWQ — this is "the right one")
- NOT
/home/b3g/start_vllm.sh (that's qwen35-35b for a different project)
Step 1 — Pick GPU under $0.18/hr
Query RunPod GraphQL to get current prices and stock:
RUNPOD_API_KEY=$(grep "^RUNPOD_API_KEY=" apps/api/.env | cut -d= -f2)
curl -sS -X POST https://api.runpod.io/graphql \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"query":"query { gpuTypes { id displayName memoryInGb lowestPrice(input: {gpuCount: 1}) { uninterruptablePrice stockStatus } } }"}'
Default pick: RTX A5000 24GB at $0.16/hr. It has hosted Cydonia successfully with 16K context + max-num-seqs=8.
If A5000 unavailable, fallback order (still respecting $0.18/hr cap):
- RTX A5000 24GB — $0.16
- RTX A4500 20GB — $0.19 (over budget — needs user approval)
- Anything else: STOP and ask user.
Never silently pick a more expensive GPU. The cap is hard.
Step 2 — Create pod with the right base image
12-May-2026 production reality (final): Community Cloud A5000 with allowedCudaVersions: ["12.8","12.9","13.0"] ALWAYS routed us to a broken FR worker (maxDownloadSpeedMbps: 755, diskThroughputMBps: 548) that stalled vllm/vllm-openai:v0.17.0 image pull indefinitely (>15 min, three attempts). Production runs on Secure Cloud A5000 at $0.27/hr (cloudType: "SECURE"). CA worker has 22 Gbps downlink — image pull takes ~5 min, model download via hf_transfer ~20 s. Total deploy: ~10 min. The $0.11/hr premium over Community ($0.16) buys reliability and is worth it; only revisit Community A5000 if RunPod fixes that FR worker's image pull issue.
Avoid the RTX 4000 Ada fallback path — it requires vllm 0.8.5 which only supports transformers 4.51.3, and that combination produced visible chat-quality regression (Brielle saying "mi amor" despite being American, generic short responses). B3G quality requires transformers 5.5.0, which requires vllm 0.17+, which requires CUDA 12.8+ driver, which on Community is only on the broken FR worker. Secure Cloud is the only working path.
CUDA matters. vLLM 0.17 + torch 2.10 needs CUDA 12.8+ driver, ideally 13.0.
CRITICAL (lesson from 11-May-2026): allowedCudaVersions must be set. Community Cloud workers ship with mixed driver versions — many A5000 hosts still have CUDA 12.4 drivers, which silently crash-loop any vllm >= 0.17. Force a 12.8+ host:
"allowedCudaVersions": ["12.8", "12.9", "13.0"]
Without this filter, RunPod picks the cheapest available worker — often the one with the old driver — and the pod restarts every ~10s with RuntimeError: The NVIDIA driver on your system is too old (found version 12040). You cannot see this error from the dashboard unless you override the entrypoint and SSH in.
Preferred image (use this, NOT runpod/pytorch): vllm/vllm-openai:v0.17.0 — pins the exact vllm version B3G runs. Ships with torch 2.10 + transformers compatible kernels pre-built. Do NOT use :latest — it ships vllm 0.20+ which needs even newer CUDA.
Why we abandoned runpod/pytorch:1.0.3-cu1300-torch290-ubuntu2404: twice in May 2026 it sat stuck on create container: still fetching image indefinitely on Community Cloud A5000 — the tag is not cached on most workers. The vllm/vllm-openai:* images pull cleanly (~90s).
Pod config (use REST API POST /v1/pods, NOT GraphQL — REST exposes dockerStartCmd / dockerEntrypoint arrays; GraphQL only has the legacy string dockerArgs):
Pod create payload (Secure Cloud A5000 — proven working as of 12-May-2026):
{
"name": "alysse-b3g-secure",
"imageName": "vllm/vllm-openai:v0.17.0",
"gpuTypeIds": ["NVIDIA RTX A5000"],
"gpuCount": 1,
"allowedCudaVersions": ["12.8", "12.9", "13.0"],
"containerDiskInGb": 30,
"volumeInGb": 50,
"volumeMountPath": "/runpod-volume",
"ports": ["9000/http", "22/tcp"],
"env": { "HF_HOME": "/runpod-volume/hf-cache", "PUBLIC_KEY": "<your ssh pubkey>" },
"dockerEntrypoint": [
"bash",
"-lc",
"apt-get update -qq && apt-get install -y -qq openssh-server && mkdir -p /var/run/sshd /root/.ssh && echo \"$PUBLIC_KEY\" > /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys && /usr/sbin/sshd -D"
],
"dockerStartCmd": [],
"cloudType": "SECURE",
"interruptible": false,
"supportPublicIp": true
}
Why idle entrypoint instead of letting vllm auto-start? Because the image's bundled huggingface_hub (1.14+) uses Xet by default, which on RunPod fails with Connection reset by peer mid-download. We need to SSH in, pip install hf_transfer, pre-download the model via hf download to /runpod-volume/models/..., then launch vllm pointing at the local path. See Step 4.
CRITICAL CLI syntax (vllm 0.17+): model is positional when launching vllm serve <model> [options], NOT --model. The image ENTRYPOINT is vllm serve so any dockerStartCmd is appended. Passing --model <name> causes vllm to crash-loop because the positional [model_tag] is missing.
VRAM sizing for A5000 24GB on Secure Cloud — full B3G config works:
--max-model-len 16384 ✓ (matches B3G)
--max-num-seqs 8 ✓ (matches B3G)
--gpu-memory-utilization 0.92 ✓ (matches B3G)
- No
--enforce-eager needed (CUDA graphs work)
- VRAM used at idle: ~21.3 / 24 GB
Note: on a 20GB GPU (RTX 4000 Ada / A4500) the B3G config will OOM. Reduce to --max-model-len 4096 --max-num-seqs 1 --gpu-memory-utilization 0.95 --enforce-eager.
--enforce-eager — skip CUDA graphs, ~1 GB VRAM savings, ~5–10 % slower decode (acceptable for prod chat)
Pre-installed in the image (do NOT reinstall): vllm 0.17.0, torch 2.10.0, transformers 5.5.0-compatible. Skip the old Step 3 — it was needed only with the generic pytorch image.
PUBLIC_KEY env var auto-populates /root/.ssh/authorized_keys on container start. Put YOUR local ~/.ssh/id_ed25519.pub here — the default key in the RunPod account belongs to a previous machine and won't let you SSH in. Pass it as a JSON env value when you POST the pod.
Step 3 — (SKIP) Install stack manually
Almost obsolete — one upgrade still needed. With vllm/vllm-openai:v0.17.0 you get vllm 0.17, torch 2.10, but transformers 4.57.6 (NOT 5.5.0). B3G chat quality requires transformers 5.5.0 — without it Cydonia 24B produces short generic output and bleeds Spanish into non-Latina characters (e.g. Brielle Hayes saying "mi amor"). SSH in and:
pip install --no-cache-dir 'transformers==5.5.0' hf_transfer
python3 -c 'import transformers; print(transformers.__version__)'
pip will auto-bump huggingface_hub from 0.30 → 1.14 to satisfy transformers 5.5 — that's fine for the pre-download path because we use the new hf download CLI (Step 4). vllm 0.17 emits a resolver warning that it expects transformers <5; ignore it, runtime works (same as B3G).
Legacy manual-install path (kept for reference)
After SSH-ing into the pod with a generic pytorch image:
mkdir -p /workspace
python3.11 -m venv /workspace/venv
source /workspace/venv/bin/activate
pip install --upgrade pip wheel
pip install torch==2.10.0
pip install vllm==0.17.0
pip install transformers==5.5.0
Verify with pip show transformers vllm torch | grep Version. Ignore the pip resolver warning about vllm requires transformers<5 — it works fine, same as B3G.
Step 4 — Download model
11-May-2026 lesson — HF Xet backend can stall mid-download. Modern huggingface_hub (≥0.31) uses the Xet CAS service by default, which on RunPod workers sometimes throws Connection reset by peer and kills vLLM's auto-download. Symptom in logs: RuntimeError: Data processing error: CAS service error : … Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }.
Reliable fix — after upgrading transformers to 5.5.0 (Step 3), use the new hf CLI (huggingface_hub 1.x renamed the binary) with hf_transfer:
mkdir -p /runpod-volume/models
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
tacodevs/Cydonia-24B-v4.3-AWQ \
--local-dir /runpod-volume/models/Cydonia-24B-v4.3-AWQ
vllm serve /runpod-volume/models/Cydonia-24B-v4.3-AWQ ...
Note the CLI rename from huggingface-cli download (deprecated, prints help in hf_hub 1.x) to hf download. Both Xet AND hf_transfer are enabled — hf_transfer's reqwest retry logic handles the Xet ConnectionReset errors; 14 GB downloads in ~20 s on Secure Cloud's 22 Gbps backbone, ~90 s on Community Cloud's 900 Mbps.
Once on /runpod-volume, the model persists across pod stop/start (volume mount).
Why not let vLLM auto-download? vllm 0.17 still calls huggingface-cli download internally; on workers with poor network the Xet CAS service throws connection-reset before vllm ever loads weights, and the engine dies with Engine core initialization failed. Pre-downloading isolates the download path from vllm's startup.
For the idle-pod debug path (use this when diagnosing why vLLM crashes — you NEED SSH):
"dockerEntrypoint": ["bash","-lc","apt-get update -qq && apt-get install -y -qq openssh-server && mkdir -p /var/run/sshd /root/.ssh && echo \"$PUBLIC_KEY\" > /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys && /usr/sbin/sshd -D"],
"dockerStartCmd": []
Pass YOUR local ~/.ssh/id_ed25519.pub as the PUBLIC_KEY env. SSH in, install hf_hub 0.30.2, pre-download model, then launch vLLM manually with nohup ... < /dev/null & disown.
Step 5 — Launch script (LEGACY — only for manual SSH-install path)
cat > /workspace/start-vllm.sh << 'EOF'
source /workspace/venv/bin/activate
pkill -9 -f "vllm serve" 2>/dev/null || true
pkill -9 -f "VLLM::EngineCore" 2>/dev/null || true
sleep 5
nohup vllm serve /workspace/models/Cydonia-24B-v4.3-AWQ \
--dtype float16 \
--quantization awq_marlin \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 8 \
--host 0.0.0.0 \
--port 9000 \
--api-key llm-team-secret \
--served-model-name cydonia-24b-v4.3 \
--trust-remote-code \
--disable-uvicorn-access-log \
> /tmp/vllm.log 2>&1 < /dev/null &
disown
echo "PID: $!"
EOF
chmod +x /workspace/start-vllm.sh
Then launch: nohup bash /workspace/start-vllm.sh > /dev/null 2>&1 < /dev/null & disown
(The nohup … < /dev/null & disown matters — without it, vLLM dies when SSH disconnects. Also pkill survivors are real: the EngineCore subprocess can outlive pkill -f "vllm serve" — kill -9 by PID if needed.)
Step 6 — Wait for ready
until grep -q "Application startup complete" /tmp/vllm.log 2>/dev/null; do sleep 5; done
Expect 25–60 sec on A5000. If OOM at startup:
- Drop
--max-num-seqs from 8 → 4 → 2
- If still OOM, drop
--gpu-memory-utilization from 0.92 → 0.90
- Do not change anything else — context length stays at 16384
Step 7 — Test directly against vLLM
curl -sS https://<POD_ID>-9000.proxy.runpod.net/v1/models \
-H "Authorization: Bearer llm-team-secret"
curl -sS -X POST https://<POD_ID>-9000.proxy.runpod.net/v1/chat/completions \
-H "Authorization: Bearer llm-team-secret" \
-H "Content-Type: application/json" \
-d '{
"model":"cydonia-24b-v4.3",
"messages":[
{"role":"system","content":"You are Valentina Reyes, a 23-year-old Venezuelan woman from Caracas. You are warm, flirty, and speak in a natural casual tone."},
{"role":"user","content":"Hey beautiful, how was your day?"}
],
"temperature":1.0, "top_p":0.95, "min_p":0.1,
"presence_penalty":0.4, "frequency_penalty":0.1, "repetition_penalty":1.15,
"max_tokens":200, "seed":42
}'
Quality smell test: response should open with *smiles brightly* or similar action, then ¡Hola!, then a 2–3-sentence narrative about her day, then bounce back a question. If it's a one-liner like "Hola amor 💕", transformers is wrong — re-check Step 3.
(GPU-arch differences between B3G 5090 and A5000 mean exact tokens differ even with same seed; the style and structure should match.)
Step 8 — Update Fly.io secret
flyctl secrets set VLLM_BASE_URL="https://<POD_ID>-9000.proxy.runpod.net" -a alysse-api
VLLM_MODEL=cydonia-24b-v4.3, VLLM_API_KEY=llm-team-secret, VLLM_MAX_CONTEXT=16384, LLM_BACKEND=vllm are already set on Fly. Don't re-touch them.
secrets set triggers a rolling restart automatically.
Step 9 — Verify production
flyctl status -a alysse-api
flyctl logs -a alysse-api --no-tail | grep "LLM service initialized" | tail -1
curl -sS https://api.alysse.me/api/health
Logs must show LLM service initialized: vLLM at https://<NEW_POD_ID>-9000.proxy.runpod.net.
Step 10 — Kill any other RunPod pods we own
We pay double for any orphan pod. Always confirm exactly one active pod after the migration:
curl -sS -X POST https://api.runpod.io/graphql \
-H "Authorization: Bearer $RUNPOD_API_KEY" \
-H "Content-Type: application/json" \
-d '{"query":"query { myself { pods { id name desiredStatus costPerHr } } }"}'
If >1 result: terminate the old one with podTerminate(input: {podId: "..."}).
Common gotchas (do NOT relearn these)
| Symptom | Cause | Fix |
|---|
| Chat output is short, generic ("Hola amor 💕") | transformers 4.x instead of 5.5 | Re-install transformers==5.5.0 |
| Image gen times out 2min20s on every request | VLLM_BASE_URL points at 192.168.89.106 (B3G LAN — Fly can't reach) | flyctl secrets set VLLM_BASE_URL=https://<pod>-9000.proxy.runpod.net |
Image upload fails AccessDenied from R2 | R2_BUCKET_NAME=alysse-images (doesn't exist) | Set R2_BUCKET_NAME=deepbond-images (real bucket name) |
| Generated image URL gives 404 after a few minutes | R2 not configured, fell back to ephemeral local FS | Set the 5 R2 secrets on Fly |
| vLLM start hangs >180s | Old vLLM EngineCore subprocess holding port 9000 | kill -9 by PID, not just pkill. Free port: fuser -k 9000/tcp |
pip install transformers==5.5.0 errors "no matching distribution" | Wrong index URL or stale cache | pip install --no-cache-dir transformers==5.5.0. It exists. |
| Two pods running, paying double | Previous "migrate" left old pod alive | After every successful redeploy, run Step 10 |
| B3G ssh works from host but Docker container ECONNREFUSED | Docker network can't see host's VPN routes | Local dev needs separate solution (port-forward or Tailscale on Docker), unrelated to prod |
vLLM crash-loop with Engine core initialization failed, GPU never touched | HF Xet CDN ConnectionReset by peer during model download | Pre-download via huggingface-cli with hf_hub==0.30.2 + HF_HUB_ENABLE_HF_TRANSFER=1, point vllm at local path |
| vLLM container exits ~80s after start, uptime resets | Container ENTRYPOINT runs vllm directly → vllm crashes on init → container restart | Override dockerEntrypoint to keep container alive (sshd-only), then run vllm manually to capture real error |
runtime: null for >5 min on Community Cloud worker | Slow worker (maxDownloadSpeedMbps < 1000) pulling 10 GB image | Terminate, pick a worker with diskThroughputMBps > 2000 and maxDownloadSpeedMbps > 900 (US workers typically faster than FR) |
vLLM crash: NVIDIA driver too old (found version 12040) | Worker has CUDA 12.4 driver, vllm 0.17+ needs 12.8+ | Either allowedCudaVersions: ["12.8","12.9","13.0"] filter OR use vllm/vllm-openai:v0.8.5 (CUDA 12.4 compat) |
Permission denied (publickey) SSHing to pod | Pod uses RunPod account's stored pubkey, not yours | Pass YOUR ~/.ssh/id_ed25519.pub as the PUBLIC_KEY env var in pod create payload; entrypoint script writes it to /root/.ssh/authorized_keys |
When this skill is invoked
The user typically says one of:
- "create me a runpod pod"
- "fix the prod chat"
- "redeploy vllm"
- "the chat is broken in prod"
- "match B3G config"
- "le pod est dead"
Always:
- Check current pod state first (Step 10's query) — maybe nothing is broken, just env mis-pointed.
- If everything's fine but chat output is bad → Step 7 quality smell test → if fails, suspect transformers version → Step 3.
- Stay under $0.18/hr always.
- Report at the end: pod ID, URL, hourly cost, transformers version installed, sample chat output.
File references
apps/api/.env — has RUNPOD_API_KEY and (for local dev only) VLLM_BASE_URL=http://192.168.89.106:9000
- B3G ref script:
/home/b3g/start-lexIO-start.sh (Cydonia)
- Pod start script:
/workspace/start-vllm.sh (recreate via Step 5 each deploy)
- Fly app:
alysse-api (region fra)
- Pod naming:
alysse-prod-cydonia-a5000