mit einem Klick
add-model
// Add a new model to the SGLang Cookbook, including documentation, sidebar, config generator component, and model YAML configuration.
// Add a new model to the SGLang Cookbook, including documentation, sidebar, config generator component, and model YAML configuration.
| name | add-model |
| description | Add a new model to the SGLang Cookbook, including documentation, sidebar, config generator component, and model YAML configuration. |
| disable-model-invocation | true |
Interactive, multi-step workflow. Collect inputs incrementally — don't ask for everything upfront.
Ask the user for:
Qwen/Qwen3-Coder-Next). Fetch the page to extract description, capabilities, etc. If the model isn't public yet, ask the user to paste what they know (name, param count, architecture, capabilities, context length).Qwen3CoderConfigGenerator and Qwen3NextConfigGenerator for multi-variant patterns.sglang serve --model-path command with all flags (tp, dp, ep, etc.). Not python -m sglang.launch_server (deprecated, issue #33). If the model card provides one, use it as starting point but verify format.v0.5.10). Used in benchmark metadata and Docker image tags. Note: the YAML directory is the latest existing data/models/src/<version>/ directory — which may lag the tested SGLang version by a minor release. Don't create a new v<X.Y.Z>/ dir for a point release that doesn't exist yet; reuse the latest dir (ls data/models/src/ to check).Read ALL reference templates first, then create files.
docs/autoregressive/ (e.g., Qwen3-Coder.md, DeepSeek-V3_2.md)src/components/autoregressive/ (e.g., Qwen3NextConfigGenerator/index.js)data/models/src/<version>/<similar-model>.yaml. Run ls data/models/src/ and pick the latest dir (e.g., v0.5.10). This is a versioned corpus, not a "this-version-only" bucket — older models stay in their original dir.sidebars.jsdata/models/vendors.yamlsrc/components/autoregressive/<ModelNameConfigGenerator>/index.js (not nested in vendor folders)data/models/src/<version>/ (not directly in data/models/src/)ConfigGenerator component: src/components/base/ConfigGeneratorsglang serve — never python -m sglang.launch_servergh pr list --search "<model name>") to avoid duplicate workcommandRule options, follow the Object.entries(this.options).forEach(...) pattern from existing generatorsOnly include platforms the user has actually tested.
| Platform | Vendor | Memory | Docker Image |
|---|---|---|---|
| A100 | NVIDIA | 80GB | lmsysorg/sglang:<ver> |
| H100 | NVIDIA | 80GB | lmsysorg/sglang:<ver> |
| H200 | NVIDIA | 141GB | lmsysorg/sglang:<ver> |
| B200 | NVIDIA | 180GB | lmsysorg/sglang:<ver> |
| B300 | NVIDIA | 275GB | lmsysorg/sglang:<ver> (or -cu130 for CUDA 13) |
| GB300 | NVIDIA | 275GB | lmsysorg/sglang:<ver>-cu130 (Grace-Blackwell, CUDA 13 required; typical single-node host = 4 GPUs → TP=4) |
| MI300X | AMD | 192GB | lmsysorg/sglang:<ver>-rocm720-mi30x |
| MI325X | AMD | 256GB | lmsysorg/sglang:<ver>-rocm720-mi30x |
| MI350X | AMD | 288GB | lmsysorg/sglang:<ver>-rocm720-mi35x |
| MI355X | AMD | 288GB | lmsysorg/sglang:<ver>-rocm720-mi35x |
TP calculation: model_weight_GB / gpu_mem_GB, round up to nearest power of 2. Leave 20-30% headroom.
Platform-specific flags (only add if tested):
--attention-backend trtllm_mha; GB300 needs the -cu130 Docker tag (CUDA 13)--attention-backend tritonSGLANG_USE_AITER=1, SGLANG_ROCM_FUSED_DECODE_MLA=0heads_per_gpu % 16 == 0)Expert Parallelism (EP) for MoE models — common patterns observed:
--tp 8 --ep 8EP = TP (e.g., --tp 4 --ep 4)--ep unless explicitly benchmarked — don't blindly scale EPNew vendor? If the vendor isn't in data/models/vendors.yaml, add an entry before referencing it in the model YAML:
<vendor-id>:
name: <Human-readable name>
huggingface_org: <HF org slug>
Without an entry, compile_models.py falls back to using the raw vendor id as huggingface_org (line ~630), so model paths may render wrong and the UI loses the human-readable vendor name. The TS schema (data/schema/types.ts) only checks that vendor is a non-empty string — it does NOT cross-check against vendors.yaml — so the failure is silent/cosmetic, not a hard CI break. Still, always add the entry.
Create docs/autoregressive/<Vendor>/<ModelName>.md:
| Hardware Platform | Docker Image |
| --- | --- |
| NVIDIA A100 / H100 / H200 / B200 | `lmsysorg/sglang:<ver>` |
| NVIDIA B300 / GB300 | `lmsysorg/sglang:<ver>-cu130` |
| AMD MI300X / MI325X | `lmsysorg/sglang:<ver>-rocm720-mi30x` |
| AMD MI350X / MI355X | `lmsysorg/sglang:<ver>-rocm720-mi35x` |
**Output Example:** + ```text block. Use Pending update... placeholders if the model isn't yet deployed.Pending update... placeholders are acceptable for unfinished runs. Benchmark test-environment metadata (Hardware, Model quantization, TP, SGLang version, Docker image) must match a quantization actually listed in Section 1 — (BF16) on a model that only released INT4 is a factual bug.Benchmark commands — each benchmark has two pieces. The deploy (server launch at the top of the section) uses sglang serve. The bench workload uses python3 -m sglang.bench_serving (never bare python -m).
SGLang built-in benchmarks (lightweight, no extra deps):
python3 benchmark/gsm8k/bench_sglang.py --port <port>python3 benchmark/mmlu/bench_sglang.py --port <port>python3 benchmark/mmmu/bench_sglang.py --port <port> — uses a universal answer regex that works across models. Don't use model-specific parsing (e.g., <|begin_of_box|>) as it breaks with standard answer formats. Note: this is plain MMMU, not MMMU-Pro or MMMU-Pro-Vision — those are separate benchmarks.python3 -m sglang.bench_serving --backend sglang --num-prompts 10 --max-concurrency 1 ...python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --max-concurrency 100 ...Heavier reasoning/MCQ suites via NVIDIA NeMo-Skills (GPQA Diamond, AIME, MMLU-Pro, etc.):
ns prepare_data <dataset> then ns eval --server_type=openai --server_address=http://localhost:30000/v1 --model=<hf-path> --benchmarks=<name>:<epochs> ...++parse_reasoning=True so the grader sees the answer, not the <think> content.++prompt_config=eval/aai/mcq-10choices (not the 4-choice config).++inference.tokens_to_generate=120000 is typical. A 32K cap often produces a spiky "No Answer" rate; call this out in the results if it happens.pass@1 (avg-of-N), majority@N, pass@N, plus a No Answer column if non-zero:
| Evaluation Mode | Accuracy | No Answer |
|--------------------|----------|-----------|
| pass@1 (avg-of-8) | 84.91% | 3.54% |
| **majority@8** | **88.89%** | 0.00% |
| pass@8 | 96.46% | 0.00% |
Keep benchmarks concise. Order: accuracy first, then speed. Don't add multiple scenarios or concurrency levels unless asked.
Notes:
temperature, top_p) in sample code — SGLang uses generation_config.json defaults. (It's fine to list "Recommended Generation Parameters" informationally in Section 1.)enable_thinking: False) examplesChatCompletionMessage(...)) into readable structured outputreasoning_content instead of (or in addition to) content. When writing the example, print both so the output isn't misleadingly None.**Output Example:** followed by a ```text fenced block with the real run output. Keep the text verbatim from the server — don't paraphrase.Edit sidebars.js — add the new entry under the right vendor.
Update docs/intro.md (homepage):
- [x] if doc has real content, - [ ] if stub/placeholderNEW tags to 3 or fewer total in each of intro.md AND sidebars.js — they're tracked independently. If adding a new NEW tag pushes either file over 3, remove the oldest NEW tag in that file first (git log --oneline -- sidebars.js / docs/intro.md to find when each tag was added).intro.md should match sidebars.jsCreate src/components/autoregressive/<ModelName>ConfigGenerator/index.js.
ConfigGenerator componentmodelConfigs with per-hardware tp and mem values: h200: { fp8: { tp: 8, mem: 0.85 }, bf16: { tp: 16, mem: 0.85 } }generateCommand:
const isAMD = ['mi300x','mi325x','mi350x','mi355x'].includes(hardware);
const isBlackwell = ['b200','b300'].includes(hardware);
if (isAMD) { /* AMD-specific flags */ }
if (isBlackwell) { /* Blackwell-specific flags */ }
commandRule for optional features (tool calling, reasoning parser, etc.)Reasoning parser: For hybrid models, use Enabled/Disabled toggle (the model always thinks; parser just separates output). For separate Instruct/Thinking variants, toggle changes the model name suffix.
Reasoning parsers fall into two client-side patterns — the sample code in Section 4 needs to match:
--reasoning-parser kimi_k2, most qwen/glm parsers): thinking text lands in message.reasoning_content, answer in message.content. Print both.--reasoning-parser minimax-append-think): thinking is wrapped in <think>...</think> inside message.content. The client has to parse the tags itself. For streaming demos, walk a buffer looking for <think> / </think> markers and split as you print.
Pick the pattern from the model card / SGLang docs for that specific parser before writing the example.DP Attention: Disabled (Low Latency) / Enabled (High Throughput). The --dp value commonly matches --tp but this isn't mandatory. Handle in generateCommand, not via static commandRule:
if (values.dpattention === 'enabled') {
cmd += ` \\\n --dp ${tpValue} \\\n --enable-dp-attention`;
}
In config tips, describe --dp matching --tp as a common pattern, not a requirement.
Large models (>400B): BF16 needs ~2x GPUs vs FP8. Reflect this in modelConfigs. Omit combos that don't fit.
Multiple variants: Add modelSize and/or quantization selectors. See GLM51ConfigGenerator, GLM5ConfigGenerator, Qwen3CoderConfigGenerator, Qwen3NextConfigGenerator for patterns.
Platform-required flags: If a platform requires certain flags to function at all (e.g., AMD MI355X needs --attention-backend triton), add them unconditionally for that platform — NOT gated behind optional checkboxes like "Performance Optimizations". Optional optimizations go inside checkbox guards; required-to-work flags go outside.
Doc ↔ generator parity: The documented per-hardware launch command (e.g., the sglang serve block in the AMD benchmark section) must be byte-for-byte identical to what the generator emits when that hardware is selected. If you add --kv-cache-dtype fp8_e4m3 or --mem-fraction-static 0.8 for AMD in the generator, the documented AMD command needs it too — and vice versa. Drift here is the single most common review finding. If a flag is platform-required (not user-toggleable), the generator owns it and the doc should mirror it.
No dead code: Don't define commandRule on options if generateCommand handles them directly (the rules will never be called). Don't use getDynamicItems if the items don't depend on other option values — use static items instead. Don't leave unused helper functions.
No silent ignores: If a feature (e.g., DP attention) is unsupported on a platform, either disable the UI option or show an explicit message (like a "Work In Progress" note). Never silently drop user selections.
Scope discipline: If adding support for one platform, don't accidentally add global flags. Always check conditionals: if (quantization === 'fp8') without a hardware guard affects ALL platforms. Be explicit: if (hardware === 'h200' && quantization === 'fp8').
License accuracy: Always verify the actual HuggingFace model license before writing the license section. Don't copy from other model docs — licenses vary (Apache 2.0, MIT, community licenses, etc.).
Create data/models/src/<version>/<modelname>.yaml:
default — balanced single-nodehigh-throughput-dp — if DP attention supportedspeculative-mtp or speculative-eagle — if speculative decoding supportedValid thinking_capability enum values: non_thinking, thinking, hybrid. Don't use hybrid_thinking or other variants — pre-commit validation rejects them.
Ensure venv exists:
python3 -m venv .venv
source .venv/bin/activate && pip install pre-commit pyyaml
Compile and validate:
source .venv/bin/activate && python data/scripts/compile_models.py
cd data/schema && npm install && npm test
Commit both src/ AND generated/: the compile-model-configs pre-commit hook auto-runs compile_models.py whenever data/models/src/*.yaml changes and writes the output to data/models/generated/<version>/<model>.yaml — but the hook does NOT auto-stage those files; you still have to git add them yourself. CI runs python3 data/scripts/compile_models.py --check, which fails if the generated file is missing or out-of-date. Stage both paths:
git add data/models/src/<version>/<model>.yaml data/models/generated/<version>/<model>.yaml
If git status shows other data/models/generated/*.yaml files appearing after compile (e.g., a previous PR forgot to commit its generated output), those are a pre-existing gap unrelated to your change — but if leaving them uncommitted will break CI for your PR, include them in a separate commit with a note ("Add missing generated X.yaml from #NNN to unblock CI").
Full build (catches import errors, broken links, component issues — more reliable than dev server):
npm run build
Dev server for visual check:
npm start
Check the page renders at http://localhost:3000.
User deploys the model, runs test scripts, pastes results. Replace TODO placeholders with actual outputs:
Ask for:
mem-fraction-static valuesAdd to docs.
Can be triggered with /add-model review. Also consider running /review-pr on the PR for an automated checklist pass.
Review the complete documentation for:
```)base_url port on the same pagePending update... / TODO placeholders replaced with actual results — OR explicitly left pending with the user's acknowledgementModel: X (BF16) is a factual bugexport default matches the actual class name (common copy-paste bug)sglang serve ... — never python -m sglang.launch_server or python3 -m sglang.launch_server (deprecated)python3 -m sglang.bench_serving ... — never bare python -m sglang.bench_servingreasoning_content and content (the latter can be None when the response is reasoning-only)**Output Example:** + ```text block with real server outputmodelConfigs include both tp and mem values per hardware/quantization--dp value dynamically matches --tp in the generatordocs/intro.md) includes the new model entry and matches sidebar ordersidebars.js AND docs/intro.md (counted independently)ChatCompletionMessage(...)) are formatted into readable structured output (Reasoning/Content/Tool Calls sections)reasoning_content field for separate-field parsers (kimi_k2 etc.), <think>...</think> tag parsing in content for inline-tag parsers (minimax-append-think etc.)data/models/vendors.yamldata/models/src/<ver>/<model>.yaml AND data/models/generated/<ver>/<model>.yaml are committed — CI's --check mode fails on missing generated filescommandRule, unused helper functions, getDynamicItems returning static arrays)export VAR=<your-value>), not export VAR=${VAR} (which is a bash no-op)Always create a new branch — never commit to main directly.
git checkout -b add-<model-name>
# ... make changes ...
git add <specific files>
git commit -m "Add <Model Name> cookbook"
git push -u origin add-<model-name>
gh pr create --title "Add <Model Name> cookbook" --body "..."
When checking homepage entries, verify the doc has real content — not just a "Community contribution welcome" stub.