| name | add-recipe |
| description | Use when the user asks to add, contribute, or create a new vLLM recipe in this repo (e.g. "add a recipe for Qwen/Qwen3-XYZ", "create a recipe for huggingface.co/org/model"). Walks through fetching HF metadata, authoring the YAML at models/<hf_org>/<hf_repo>.yaml, picking variants/strategies, validating, and committing. |
Add a new vLLM recipe
Recipes are YAML files at models/<hf_org>/<hf_repo>.yaml. The path mirrors HuggingFace (huggingface.co/<hf_org>/<hf_repo>), and the site/API are generated at build time from these files + taxonomy.yaml + strategies/*.yaml.
End-to-end steps
-
Confirm the HF id. You need the exact <org>/<repo> string. If the user gave a URL, strip the https://huggingface.co/ prefix.
-
Fetch model metadata. Run bash scripts/hf-info.sh <org>/<repo> to pull config.json / params.json. Extract:
architecture: moe if num_experts, num_local_experts, moe.num_experts, or a *MoE* architecture name is present. Otherwise dense.
parameter_count: total params (e.g. "671B", "70B"). Use HF model card or the sum of shard sizes.
active_parameters: for MoE, the activated-per-token count (e.g. "37B" on DeepSeek-V3.2). For dense, equal to parameter_count.
context_length: max_position_embeddings from config.json (for VL models, from text_config.max_position_embeddings).
-
Read the README — don't skip this. Run curl -sL "https://huggingface.co/<org>/<repo>/resolve/main/README.md" and scan the install / serve / usage sections in full. Configs are not enough; model authors put load-bearing requirements in prose. Mine the README for:
min_vllm_version / nightly_required — phrases like "install vllm nightly", "requires nightly wheels", or an install snippet using --extra-index-url https://wheels.vllm.ai/nightly mean min_vllm_version: "nightly" + nightly_required: true. A specific tag like "vLLM >= 0.12.0" sets that version. Don't default to 0.11.0 when the README says otherwise.
dependencies: — any pip line beyond vllm itself: version pins (mistral_common >= 1.11.1, transformers >= 5.4.0), extras (vllm[audio]), source installs (pip install git+...), DeepGEMM pins, etc. Pin them even when the README says "auto-installed" — users on stale wheel caches need an explicit upgrade path. Each entry needs a one-line note saying why.
- Parser flags for
features: — --tool-call-parser <name>, --reasoning-parser <name>, --enable-auto-tool-choice. Use the exact parser name the README specifies.
- Companion / draft repos — EAGLE / MTP / Eagle3 heads, NVFP4 quants, instruct vs base. Wire as
spec_decoding feature (draft pointer in --speculative-config) or a sibling variant with model_id: override. Copy the recommended --speculative-config JSON verbatim from the README.
- Recommended serve flags —
--tensor-parallel-size, --gpu_memory_utilization, --max_num_batched_tokens, --max_num_seqs go into the guide's launch command and into variant extra_args when they're variant-specific.
- Hardware guidance / sampling defaults — "recommended on 8xH200" lines inform variant
description + vram_minimum_gb; recommended temperature / top_p / reasoning_effort go in the guide's Client Usage block.
-
Cross-check upstream vLLM support. The README is a snapshot — if it was written at a moment when only nightly worked, that claim rots once stable ships. Never copy the README's "vLLM nightly" claim verbatim without checking. Run these in parallel:
What to extract:
min_vllm_version — set to the lowest stable tag where the model actually works, not what the README claims. Walk forward from the support-PR's release tag, but bump up if there are known parser/tokenizer/quant bugs fixed in a later release (the v0.20.0-style "Mistral Grammar factory" / "tool parser HF-tokenizer fix" entries are signals to bump). Only use min_vllm_version: "nightly" + nightly_required: true when the registry at the latest stable tag genuinely lacks the architecture — and double-check by curling registry.py at that tag. If support is still an open issue (no PR merged), flag this to the user before authoring. For derivative releases (e.g. PaddleOCR-VL-1.5 vs 1.0) with identical architectures / model_type / auto_map, the existing handler usually loads them via --trust-remote-code even before a dedicated PR — note this assumption in your reply.
- Required serve flags hidden in upstream docs — copy any
must be served with <flag> lines from supported_models.md straight into model.base_args (and call them out in the guide's launch command). These are not optional and the README often doesn't mention them.
- Troubleshooting — recurring errors and fixes from issue comments (e.g. "needs
--enforce-eager on 0.11.x", "transformers>=5 required", "--mm-processor-cache-gb 0 to avoid OOM"). Surface these in the guide's Troubleshooting section, or as inline tips next to the launch command if they're load-bearing.
- Links to put in
guide's References — the model card, vLLM support PR (not the recipe-request issue — see below), and any author-side deployment doc. These give users a path forward if their setup breaks.
What NOT to put in References: the recipe-request issue in vllm-project/recipes (e.g. #459) is a tracking ticket, not a user-facing reference. It belongs in the PR description body (Closes #459), never in the YAML's ## References section.
-
Create the YAML. Write models/<hf_org>/<hf_repo>.yaml following the schema below. Only include sections the model needs; leave features: {}, opt_in_features: [], hardware_overrides: {}, strategy_overrides: {} empty if not applicable.
-
Register the provider (if new). If <hf_org> isn't already in src/lib/providers.js, add an entry with display_name and the logo path /providers/<hf_org>.png (or .jpeg). Logos get downloaded by scripts/fetch-provider-logos.mjs on the next build.
-
Validate. Run node scripts/build-recipes-api.mjs. It must print ✓ JSON API: N models, 7 strategies with no errors.
-
Commit. Follow the user's earlier feedback (no kill-and-rebuild of dev server; syntax-check only).
YAML schema (top-level fields, in order)
meta:
title: "..."
slug: "..."
provider: "..."
description: "..."
date_updated: YYYY-MM-DD
difficulty: beginner|intermediate|advanced
tasks:
- text
performance_headline: "..."
related_recipes: []
hardware:
h200: verified
mi355x: verified
model:
model_id: "<hf_org>/<hf_repo>"
min_vllm_version: "0.11.0"
docker_image: ""
nightly_required: false
install:
pip:
command: ""
note: ""
docker:
note: ""
architecture: dense|moe
parameter_count: "30B"
active_parameters: "30B"
context_length: 131072
base_args: []
base_env: {}
dependencies:
- note: "Why you need it (one line)"
command: 'uv pip install -U "vllm[audio]"'
optional: false
brand: NVIDIA
features:
tool_calling:
description: "..."
args: ["--enable-auto-tool-choice", "--tool-call-parser", "<name>"]
reasoning:
description: "..."
args: ["--reasoning-parser", "<name>"]
spec_decoding:
description: "..."
args: ["--speculative-config", '{"method":"mtp","num_speculative_tokens":1}']
opt_in_features:
- spec_decoding
variants:
default:
precision: bf16|fp8|nvfp4|fp4|int4|int8|awq|gptq|mxfp4
vram_minimum_gb: <integer>
description: "..."
fp8:
model_id: "<optional override>"
precision: fp8
vram_minimum_gb: <integer>
description: "..."
extra_args: []
extra_env: {}
compatible_strategies:
- single_node_tp
- single_node_tep
- single_node_dep
- multi_node_tp
- multi_node_dep
- multi_node_tep
- pd_cluster
hardware_overrides:
hopper: { extra_args: [], extra_env: {} }
blackwell: { extra_args: [], extra_env: {} }
amd: { extra_args: [], extra_env: {} }
strategy_overrides:
single_node_tp:
tp: 1
extra_args: []
extra_env: {}
guide: |
...
...
...
...
- [Model card](https://huggingface.co/<hf_org>/<hf_repo>)
VRAM formula
vram_minimum_gb = ceil(params × bytes_per_param × 1.2) where params is the total parameter count (MoE includes inactive experts — they still live in VRAM).
| Precision | Bytes/param |
|---|
| bf16, fp16 | 2 |
| fp8, int8, awq, gptq (8-bit) | 1 |
| int4, nvfp4, fp4, mxfp4 (4-bit) | 0.5 |
Example: a 70B BF16 model → 70 × 2 × 1.2 = 168 GB. Round up.
If the variant is model_id-overridden and the override is a different base model with its own param count (e.g. a distilled FP4 checkpoint), use the override's parameter count — verify it via HF.
Naming and conventions
- Feature keys: prefer
tool_calling, reasoning, spec_decoding. Don't use mtp — it's been renamed across the repo.
- Strategy list: MoE recipes usually support every strategy; dense recipes are limited to
single_node_tp and multi_node_tp (TEP/DEP require MoE).
- Variants: quantized variants reuse the base name (
fp8, nvfp4, int4). If the quantized checkpoint is authored by someone else (e.g. nvidia/*-NVFP4), set model_id: inside the variant.
- Tasks:
omni means served via vLLM-Omni (vllm serve <model> --omni). Add a top-level omni: block listing the task ids the recipe supports — bare strings for catalog defaults (tasks: [t2i]) or { id, model_id?, vram_minimum_gb?, description?, extra_args? } overrides when a task swaps the checkpoint (Wan2.2) or needs per-task flags. Audio-only recipes set omni.serve_binary: "vllm-omni serve". The catalog is src/lib/omni-tasks.js; do not add --omni to model.base_args (auto-injected).
Validation checklist
Before committing:
node scripts/build-recipes-api.mjs succeeds and the new recipe appears in the line count.
node -e "const d = require('./public/<hf_org>/<hf_repo>.json'); console.log(d.model.parameter_count, d.variants.default.vram_minimum_gb)" prints sensible values.
- The YAML top-level key order matches the schema above — downstream tools don't care, but reviewers scan for it.
Commit
Stage only the new recipe (and providers.js if edited):
git add models/<hf_org>/<hf_repo>.yaml src/lib/providers.js
git commit -m "Add <hf_org>/<hf_repo> recipe"
Do not stage public/ (it's generated) or the design docs.