一键在 Manus 中运行任何 Skill

$pwd:

runpod-vllm-deploy

Name: Runpod Vllm Deploy
Author: ayia

// Deploy or refresh a RunPod vLLM pod for Alysse production, mirroring the B3G dev server config (Cydonia-24B-v4.3-AWQ + transformers 5.5.0). Picks the cheapest GPU under $0.18/hr (RTX A5000 24GB), installs the exact stack that produces the chat quality the user has tuned for, and updates the Fly.io VLLM_BASE_URL secret. TRIGGER when: user says "create runpod pod", "redeploy vllm", "fix prod chat", "the runpod is dead", or "match B3G config".

在 Manus 中运行

$ git log --oneline --stat

stars:0

forks:0

updated:2026年5月23日 22:19

SKILL.md

readonly

related-skills.json

同仓库

impeccable.md

from "ayia/cakeia"

Use when the user wants to design, redesign, shape, critique, audit, polish, clarify, distill, harden, optimize, adapt, animate, colorize, extract, or otherwise improve a frontend interface. Covers websites, landing pages, dashboards, product UI, app shells, components, forms, settings, onboarding, and empty states. Handles UX review, visual hierarchy, information architecture, cognitive load, accessibility, performance, responsive behavior, theming, anti-patterns, typography, fonts, spacing, layout, alignment, color, motion, micro-interactions, UX copy, error states, edge cases, i18n, and reusable design systems or tokens. Also use for bland designs that need to become bolder or more delightful, loud designs that should become quieter, live browser iteration on UI elements, or ambitious visual effects that should feel technically extraordinary. Not for backend-only or non-UI tasks.

2026-05-230

chat-tester.md

from "ayia/cakeia"

Interactive chat tester for Alysse. Has a REAL adaptive conversation with a character (40-50 messages SFW+NSFW), scores on 10 human-feel dimensions + 8 hard floors + memory deep-test, learns from past tests, and iterates fixes until READY. **Use when:** - Testing a character's chat quality - After modifying a character config - After any chat/memory/LLM code change - Validating a new or cloned character **Trigger phrases:** "test chat", "chat QA", "interactive test", "test valentina", "test the chat" **Examples:** <example> user: "test chat on valentina" assistant: "Launching chat-tester for interactive Valentina chat quality test." </example> <example> user: "/chat-tester amara-diallo en fr" assistant: "Testing Amara Diallo in English and French." </example>

2026-05-230

character-gallery-generator.md

from "ayia/cakeia"

Generate one or more /meet gallery images for an Alysse character using the full Valentina-grade pipeline: scene spec → RunPod Hub LoRA generation → visual validation → R2 upload → old-key cleanup → manifest update. Captures every lesson learned during the 22-character gallery rebuild (May 2026). TRIGGER when: user says "generate gallery image for X", "create new gallery for Y", "redo X's gallery image N", "add a new pose to Z", "regenerate X gallery", or any variation of producing character meet-page imagery.

2026-05-230

package.json

"author": "ayia"

"repository": "ayia/cakeia"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

网络与计算机系统管理员计算机与数学类职业15-1244L4

name	runpod-vllm-deploy
description	Deploy or refresh a RunPod vLLM pod for Alysse production, mirroring the B3G dev server config (Cydonia-24B-v4.3-AWQ + transformers 5.5.0). Picks the cheapest GPU under $0.18/hr (RTX A5000 24GB), installs the exact stack that produces the chat quality the user has tuned for, and updates the Fly.io VLLM_BASE_URL secret. TRIGGER when: user says "create runpod pod", "redeploy vllm", "fix prod chat", "the runpod is dead", or "match B3G config".
origin	alysse-internal

RunPod vLLM Deploy — Hard-Won Recipe

This skill captures every lesson learned during the Apr–May 2026 production migration off B3G onto RunPod. Each rule below cost at least one failed deployment or wrong-output incident. Follow them verbatim.

Why this skill exists

The user spent weeks tuning chat quality on B3G (RTX 5090 32GB, on-prem). When we replaced B3G with RunPod RTX A5000 in production, chat quality collapsed even though we copied the vLLM args verbatim. Three separate root causes all had to be fixed:

Wrong bucket name in env (R2 silently fell back to local FS → ephemeral images)
vLLM version too old (0.7.3 vs B3G 0.17.0 → different chat template handling)
transformers version too old (4.57.6 vs B3G 5.5.0 → completely different output character)

The third one was the killer and the agent in the prior session said "transformers==5.5.0 doesn't exist on PyPI" — that was wrong. It does exist; install it directly.

Step 0 — Confirm reference: SSH B3G if reachable

ssh -i ~/.ssh/b3g_key b3g@192.168.89.106 \
  "/home/b3g/vllm_env/bin/pip show transformers vllm torch 2>&1 | grep Version"

Expected output:

Version: 5.5.0   # transformers
Version: 0.17.0  # vllm
Version: 2.10.0  # torch

If B3G unreachable (VPN off): use these versions anyway — they are the canonical reference.

The Cydonia startup script on B3G is:

/home/b3g/start-lexIO-start.sh (Cydonia-24B-v4.3-AWQ — this is "the right one")
NOT /home/b3g/start_vllm.sh (that's qwen35-35b for a different project)

Step 1 — Pick GPU under $0.18/hr

Query RunPod GraphQL to get current prices and stock:

RUNPOD_API_KEY=$(grep "^RUNPOD_API_KEY=" apps/api/.env | cut -d= -f2)
curl -sS -X POST https://api.runpod.io/graphql \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query":"query { gpuTypes { id displayName memoryInGb lowestPrice(input: {gpuCount: 1}) { uninterruptablePrice stockStatus } } }"}'

Default pick: RTX A5000 24GB at $0.16/hr. It has hosted Cydonia successfully with 16K context + max-num-seqs=8.

If A5000 unavailable, fallback order (still respecting $0.18/hr cap):

RTX A5000 24GB — $0.16
RTX A4500 20GB — $0.19 (over budget — needs user approval)
Anything else: STOP and ask user.

Never silently pick a more expensive GPU. The cap is hard.

Step 2 — Create pod with the right base image

12-May-2026 production reality (final): Community Cloud A5000 with allowedCudaVersions: ["12.8","12.9","13.0"] ALWAYS routed us to a broken FR worker (maxDownloadSpeedMbps: 755, diskThroughputMBps: 548) that stalled vllm/vllm-openai:v0.17.0 image pull indefinitely (>15 min, three attempts). Production runs on Secure Cloud A5000 at $0.27/hr (cloudType: "SECURE"). CA worker has 22 Gbps downlink — image pull takes ~5 min, model download via hf_transfer ~20 s. Total deploy: ~10 min. The $0.11/hr premium over Community ($0.16) buys reliability and is worth it; only revisit Community A5000 if RunPod fixes that FR worker's image pull issue.

Avoid the RTX 4000 Ada fallback path — it requires vllm 0.8.5 which only supports transformers 4.51.3, and that combination produced visible chat-quality regression (Brielle saying "mi amor" despite being American, generic short responses). B3G quality requires transformers 5.5.0, which requires vllm 0.17+, which requires CUDA 12.8+ driver, which on Community is only on the broken FR worker. Secure Cloud is the only working path.

CUDA matters. vLLM 0.17 + torch 2.10 needs CUDA 12.8+ driver, ideally 13.0.

CRITICAL (lesson from 11-May-2026): allowedCudaVersions must be set. Community Cloud workers ship with mixed driver versions — many A5000 hosts still have CUDA 12.4 drivers, which silently crash-loop any vllm >= 0.17. Force a 12.8+ host:

"allowedCudaVersions": ["12.8", "12.9", "13.0"]

Without this filter, RunPod picks the cheapest available worker — often the one with the old driver — and the pod restarts every ~10s with RuntimeError: The NVIDIA driver on your system is too old (found version 12040). You cannot see this error from the dashboard unless you override the entrypoint and SSH in.

Preferred image (use this, NOT runpod/pytorch): vllm/vllm-openai:v0.17.0 — pins the exact vllm version B3G runs. Ships with torch 2.10 + transformers compatible kernels pre-built. Do NOT use :latest — it ships vllm 0.20+ which needs even newer CUDA.

Why we abandoned runpod/pytorch:1.0.3-cu1300-torch290-ubuntu2404: twice in May 2026 it sat stuck on create container: still fetching image indefinitely on Community Cloud A5000 — the tag is not cached on most workers. The vllm/vllm-openai:* images pull cleanly (~90s).

Pod config (use REST API POST /v1/pods, NOT GraphQL — REST exposes dockerStartCmd / dockerEntrypoint arrays; GraphQL only has the legacy string dockerArgs):

Pod create payload (Secure Cloud A5000 — proven working as of 12-May-2026):

{
  "name": "alysse-b3g-secure",
  "imageName": "vllm/vllm-openai:v0.17.0",
  "gpuTypeIds": ["NVIDIA RTX A5000"],
  "gpuCount": 1,
  "allowedCudaVersions": ["12.8", "12.9", "13.0"],
  "containerDiskInGb": 30,
  "volumeInGb": 50,
  "volumeMountPath": "/runpod-volume",
  "ports": ["9000/http", "22/tcp"],
  "env": { "HF_HOME": "/runpod-volume/hf-cache", "PUBLIC_KEY": "<your ssh pubkey>" },
  "dockerEntrypoint": [
    "bash",
    "-lc",
    "apt-get update -qq && apt-get install -y -qq openssh-server && mkdir -p /var/run/sshd /root/.ssh && echo \"$PUBLIC_KEY\" > /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys && /usr/sbin/sshd -D"
  ],
  "dockerStartCmd": [],
  "cloudType": "SECURE",
  "interruptible": false,
  "supportPublicIp": true
}

Why idle entrypoint instead of letting vllm auto-start? Because the image's bundled huggingface_hub (1.14+) uses Xet by default, which on RunPod fails with Connection reset by peer mid-download. We need to SSH in, pip install hf_transfer, pre-download the model via hf download to /runpod-volume/models/..., then launch vllm pointing at the local path. See Step 4.

CRITICAL CLI syntax (vllm 0.17+): model is positional when launching vllm serve <model> [options], NOT --model. The image ENTRYPOINT is vllm serve so any dockerStartCmd is appended. Passing --model <name> causes vllm to crash-loop because the positional [model_tag] is missing.

VRAM sizing for A5000 24GB on Secure Cloud — full B3G config works:

--max-model-len 16384 ✓ (matches B3G)
--max-num-seqs 8 ✓ (matches B3G)
--gpu-memory-utilization 0.92 ✓ (matches B3G)
No --enforce-eager needed (CUDA graphs work)
VRAM used at idle: ~21.3 / 24 GB

Note: on a 20GB GPU (RTX 4000 Ada / A4500) the B3G config will OOM. Reduce to --max-model-len 4096 --max-num-seqs 1 --gpu-memory-utilization 0.95 --enforce-eager.

--enforce-eager — skip CUDA graphs, ~1 GB VRAM savings, ~5–10 % slower decode (acceptable for prod chat)

Pre-installed in the image (do NOT reinstall): vllm 0.17.0, torch 2.10.0, transformers 5.5.0-compatible. Skip the old Step 3 — it was needed only with the generic pytorch image.

PUBLIC_KEY env var auto-populates /root/.ssh/authorized_keys on container start. Put YOUR local ~/.ssh/id_ed25519.pub here — the default key in the RunPod account belongs to a previous machine and won't let you SSH in. Pass it as a JSON env value when you POST the pod.

Step 3 — (SKIP) Install stack manually

Almost obsolete — one upgrade still needed. With vllm/vllm-openai:v0.17.0 you get vllm 0.17, torch 2.10, but transformers 4.57.6 (NOT 5.5.0). B3G chat quality requires transformers 5.5.0 — without it Cydonia 24B produces short generic output and bleeds Spanish into non-Latina characters (e.g. Brielle Hayes saying "mi amor"). SSH in and:

pip install --no-cache-dir 'transformers==5.5.0' hf_transfer
python3 -c 'import transformers; print(transformers.__version__)'  # must print 5.5.0

pip will auto-bump huggingface_hub from 0.30 → 1.14 to satisfy transformers 5.5 — that's fine for the pre-download path because we use the new hf download CLI (Step 4). vllm 0.17 emits a resolver warning that it expects transformers <5; ignore it, runtime works (same as B3G).

Legacy manual-install path (kept for reference)

After SSH-ing into the pod with a generic pytorch image:

mkdir -p /workspace
python3.11 -m venv /workspace/venv
source /workspace/venv/bin/activate
pip install --upgrade pip wheel
pip install torch==2.10.0
pip install vllm==0.17.0
pip install transformers==5.5.0   # exists on PyPI, install it

Verify with pip show transformers vllm torch | grep Version. Ignore the pip resolver warning about vllm requires transformers<5 — it works fine, same as B3G.

Step 4 — Download model

11-May-2026 lesson — HF Xet backend can stall mid-download. Modern huggingface_hub (≥0.31) uses the Xet CAS service by default, which on RunPod workers sometimes throws Connection reset by peer and kills vLLM's auto-download. Symptom in logs: RuntimeError: Data processing error: CAS service error : … Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }.

Reliable fix — after upgrading transformers to 5.5.0 (Step 3), use the new hf CLI (huggingface_hub 1.x renamed the binary) with hf_transfer:

mkdir -p /runpod-volume/models
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
  tacodevs/Cydonia-24B-v4.3-AWQ \
  --local-dir /runpod-volume/models/Cydonia-24B-v4.3-AWQ
# Then launch vllm against the LOCAL path
vllm serve /runpod-volume/models/Cydonia-24B-v4.3-AWQ ...

Note the CLI rename from huggingface-cli download (deprecated, prints help in hf_hub 1.x) to hf download. Both Xet AND hf_transfer are enabled — hf_transfer's reqwest retry logic handles the Xet ConnectionReset errors; 14 GB downloads in ~20 s on Secure Cloud's 22 Gbps backbone, ~90 s on Community Cloud's 900 Mbps.

Once on /runpod-volume, the model persists across pod stop/start (volume mount).

Why not let vLLM auto-download? vllm 0.17 still calls huggingface-cli download internally; on workers with poor network the Xet CAS service throws connection-reset before vllm ever loads weights, and the engine dies with Engine core initialization failed. Pre-downloading isolates the download path from vllm's startup.

For the idle-pod debug path (use this when diagnosing why vLLM crashes — you NEED SSH):

"dockerEntrypoint": ["bash","-lc","apt-get update -qq && apt-get install -y -qq openssh-server && mkdir -p /var/run/sshd /root/.ssh && echo \"$PUBLIC_KEY\" > /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys && /usr/sbin/sshd -D"],
"dockerStartCmd": []

Pass YOUR local ~/.ssh/id_ed25519.pub as the PUBLIC_KEY env. SSH in, install hf_hub 0.30.2, pre-download model, then launch vLLM manually with nohup ... < /dev/null & disown.

Step 5 — Launch script (LEGACY — only for manual SSH-install path)

cat > /workspace/start-vllm.sh << 'EOF'
#!/bin/bash
source /workspace/venv/bin/activate
pkill -9 -f "vllm serve" 2>/dev/null || true
pkill -9 -f "VLLM::EngineCore" 2>/dev/null || true
sleep 5

nohup vllm serve /workspace/models/Cydonia-24B-v4.3-AWQ \
  --dtype float16 \
  --quantization awq_marlin \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 8 \
  --host 0.0.0.0 \
  --port 9000 \
  --api-key llm-team-secret \
  --served-model-name cydonia-24b-v4.3 \
  --trust-remote-code \
  --disable-uvicorn-access-log \
  > /tmp/vllm.log 2>&1 < /dev/null &
disown
echo "PID: $!"
EOF
chmod +x /workspace/start-vllm.sh

Then launch: nohup bash /workspace/start-vllm.sh > /dev/null 2>&1 < /dev/null & disown

(The nohup … < /dev/null & disown matters — without it, vLLM dies when SSH disconnects. Also pkill survivors are real: the EngineCore subprocess can outlive pkill -f "vllm serve" — kill -9 by PID if needed.)

Step 6 — Wait for ready

until grep -q "Application startup complete" /tmp/vllm.log 2>/dev/null; do sleep 5; done

Expect 25–60 sec on A5000. If OOM at startup:

Drop --max-num-seqs from 8 → 4 → 2
If still OOM, drop --gpu-memory-utilization from 0.92 → 0.90
Do not change anything else — context length stays at 16384

Step 7 — Test directly against vLLM

curl -sS https://<POD_ID>-9000.proxy.runpod.net/v1/models \
  -H "Authorization: Bearer llm-team-secret"
# Must return cydonia-24b-v4.3 with max_model_len=16384

curl -sS -X POST https://<POD_ID>-9000.proxy.runpod.net/v1/chat/completions \
  -H "Authorization: Bearer llm-team-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "model":"cydonia-24b-v4.3",
    "messages":[
      {"role":"system","content":"You are Valentina Reyes, a 23-year-old Venezuelan woman from Caracas. You are warm, flirty, and speak in a natural casual tone."},
      {"role":"user","content":"Hey beautiful, how was your day?"}
    ],
    "temperature":1.0, "top_p":0.95, "min_p":0.1,
    "presence_penalty":0.4, "frequency_penalty":0.1, "repetition_penalty":1.15,
    "max_tokens":200, "seed":42
  }'

Quality smell test: response should open with *smiles brightly* or similar action, then ¡Hola!, then a 2–3-sentence narrative about her day, then bounce back a question. If it's a one-liner like "Hola amor 💕", transformers is wrong — re-check Step 3.

(GPU-arch differences between B3G 5090 and A5000 mean exact tokens differ even with same seed; the style and structure should match.)

Step 8 — Update Fly.io secret

flyctl secrets set VLLM_BASE_URL="https://<POD_ID>-9000.proxy.runpod.net" -a alysse-api

VLLM_MODEL=cydonia-24b-v4.3, VLLM_API_KEY=llm-team-secret, VLLM_MAX_CONTEXT=16384, LLM_BACKEND=vllm are already set on Fly. Don't re-touch them.

secrets set triggers a rolling restart automatically.

Step 9 — Verify production

flyctl status -a alysse-api
flyctl logs -a alysse-api --no-tail | grep "LLM service initialized" | tail -1
curl -sS https://api.alysse.me/api/health

Logs must show LLM service initialized: vLLM at https://<NEW_POD_ID>-9000.proxy.runpod.net.

Step 10 — Kill any other RunPod pods we own

We pay double for any orphan pod. Always confirm exactly one active pod after the migration:

curl -sS -X POST https://api.runpod.io/graphql \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query":"query { myself { pods { id name desiredStatus costPerHr } } }"}'

If >1 result: terminate the old one with podTerminate(input: {podId: "..."}).

Common gotchas (do NOT relearn these)

Symptom	Cause	Fix
Chat output is short, generic ("Hola amor 💕")	transformers 4.x instead of 5.5	Re-install `transformers==5.5.0`
Image gen times out 2min20s on every request	`VLLM_BASE_URL` points at `192.168.89.106` (B3G LAN — Fly can't reach)	`flyctl secrets set VLLM_BASE_URL=https://<pod>-9000.proxy.runpod.net`
Image upload fails `AccessDenied` from R2	`R2_BUCKET_NAME=alysse-images` (doesn't exist)	Set `R2_BUCKET_NAME=deepbond-images` (real bucket name)
Generated image URL gives 404 after a few minutes	R2 not configured, fell back to ephemeral local FS	Set the 5 R2 secrets on Fly
vLLM start hangs >180s	Old vLLM EngineCore subprocess holding port 9000	`kill -9` by PID, not just `pkill`. Free port: `fuser -k 9000/tcp`
`pip install transformers==5.5.0` errors "no matching distribution"	Wrong index URL or stale cache	`pip install --no-cache-dir transformers==5.5.0`. It exists.
Two pods running, paying double	Previous "migrate" left old pod alive	After every successful redeploy, run Step 10
B3G ssh works from host but Docker container ECONNREFUSED	Docker network can't see host's VPN routes	Local dev needs separate solution (port-forward or Tailscale on Docker), unrelated to prod
vLLM crash-loop with `Engine core initialization failed`, GPU never touched	HF Xet CDN `ConnectionReset by peer` during model download	Pre-download via `huggingface-cli` with `hf_hub==0.30.2` + `HF_HUB_ENABLE_HF_TRANSFER=1`, point vllm at local path
vLLM container exits ~80s after start, uptime resets	Container ENTRYPOINT runs vllm directly → vllm crashes on init → container restart	Override `dockerEntrypoint` to keep container alive (sshd-only), then run vllm manually to capture real error
`runtime: null` for >5 min on Community Cloud worker	Slow worker (`maxDownloadSpeedMbps` < 1000) pulling 10 GB image	Terminate, pick a worker with `diskThroughputMBps > 2000` and `maxDownloadSpeedMbps > 900` (US workers typically faster than FR)
vLLM crash: `NVIDIA driver too old (found version 12040)`	Worker has CUDA 12.4 driver, vllm 0.17+ needs 12.8+	Either `allowedCudaVersions: ["12.8","12.9","13.0"]` filter OR use `vllm/vllm-openai:v0.8.5` (CUDA 12.4 compat)
`Permission denied (publickey)` SSHing to pod	Pod uses RunPod account's stored pubkey, not yours	Pass YOUR `~/.ssh/id_ed25519.pub` as the `PUBLIC_KEY` env var in pod create payload; entrypoint script writes it to `/root/.ssh/authorized_keys`

When this skill is invoked

The user typically says one of:

"create me a runpod pod"
"fix the prod chat"
"redeploy vllm"
"the chat is broken in prod"
"match B3G config"
"le pod est dead"

Always:

Check current pod state first (Step 10's query) — maybe nothing is broken, just env mis-pointed.
If everything's fine but chat output is bad → Step 7 quality smell test → if fails, suspect transformers version → Step 3.
Stay under $0.18/hr always.
Report at the end: pod ID, URL, hourly cost, transformers version installed, sample chat output.

File references

apps/api/.env — has RUNPOD_API_KEY and (for local dev only) VLLM_BASE_URL=http://192.168.89.106:9000
B3G ref script: /home/b3g/start-lexIO-start.sh (Cydonia)
Pod start script: /workspace/start-vllm.sh (recreate via Step 5 each deploy)
Fly app: alysse-api (region fra)
Pod naming: alysse-prod-cydonia-a5000

runpod-vllm-deploy

同仓库更多 Skills

同仓库更多 Skills

RunPod vLLM Deploy — Hard-Won Recipe

Why this skill exists

Step 0 — Confirm reference: SSH B3G if reachable

Step 1 — Pick GPU under $0.18/hr

Step 2 — Create pod with the right base image

Step 3 — (SKIP) Install stack manually

Step 4 — Download model

Step 5 — Launch script (LEGACY — only for manual SSH-install path)

Step 6 — Wait for ready

Step 7 — Test directly against vLLM

Step 8 — Update Fly.io secret

Step 9 — Verify production

Step 10 — Kill any other RunPod pods we own

Common gotchas (do NOT relearn these)

When this skill is invoked

File references

RunPod vLLM Deploy — Hard-Won Recipe

Why this skill exists

Step 0 — Confirm reference: SSH B3G if reachable

Step 1 — Pick GPU under $0.18/hr

Step 2 — Create pod with the right base image

Step 3 — (SKIP) Install stack manually

Step 4 — Download model

Step 5 — Launch script (LEGACY — only for manual SSH-install path)

Step 6 — Wait for ready

Step 7 — Test directly against vLLM

Step 8 — Update Fly.io secret

Step 9 — Verify production

Step 10 — Kill any other RunPod pods we own

Common gotchas (do NOT relearn these)

When this skill is invoked

File references