بنقرة واحدة
onboard-model
// End-to-end pipeline for discovering new LLMs with novel kernels and onboarding them into FlashInfer-Bench. Orchestrates repo updates, model discovery, kernel definition generation, workload collection, and PR submission.
// End-to-end pipeline for discovering new LLMs with novel kernels and onboarding them into FlashInfer-Bench. Orchestrates repo updates, model discovery, kernel definition generation, workload collection, and PR submission.
Auto-collect workloads from SGLang inference runs using FlashInfer logging API. Dumps tensors, sanitizes them according to kernel definitions, and submits PR to flashinfer-trace workload repo.
Discover candidate LLMs and produce a kernel inventory — required definitions, classified as existing/new and fi_supported/fi_missing — for onboarding. Use as Phase 1 of /onboard-model, or standalone to plan onboarding work.
Generate Definition JSON files for the flashinfer-trace HuggingFace dataset by harvesting them from a short SGLang inference pass (FlashInfer's @flashinfer_api(trace=...) dumper) — or, as a fallback, by manually transcribing the schema from SGLang sources when FlashInfer doesn't yet have a trace template. Use when adding a new model, extracting GPU kernels (MLA, MoE, GQA, RMSNorm, GEMM, GDN, RoPE, sampling), or filling gaps in the dataset.
Add pytest tests to validate reference implementations in the flashinfer-trace HuggingFace dataset against FlashInfer or SGLang ground truth. Use when validating kernel definitions, adding tests for new op_types, or verifying reference implementations are correct.
Clone SGLang, FlashInfer, sgl-cookbook, and flashinfer-trace repositories to tmp/. Use when setting up the project, preparing for kernel extraction, or when the user needs the source repositories.
Open the per-definition pair of PRs that publishes a model onboarding — PR 2 to the HuggingFace flashinfer-trace dataset (definition + reference test + baseline solution + workloads + blobs + eval traces) and PR 1 to flashinfer-bench (docs/model_coverage.mdx update only). Use as Phase 4 of /onboard-model.
| name | onboard-model |
| description | End-to-end pipeline for discovering new LLMs with novel kernels and onboarding them into FlashInfer-Bench. Orchestrates repo updates, model discovery, kernel definition generation, workload collection, and PR submission. |
Thin orchestrator that runs the five-phase pipeline for adding a new LLM to FlashInfer-Bench. Each phase delegates to a focused skill — this file is the contract that chains them together via a shared run manifest.
| Phase | Skill | Output |
|---|---|---|
| 0 | /clone-repos | tmp/sglang/, tmp/flashinfer/, tmp/sgl-cookbook/, tmp/flashinfer-trace/ cloned and current |
| 1 | /discover-models | manifest kernels[] populated with phase1_status, fi_status, fi_trace_template, sgl_status |
| 2 | /extract-kernel-definitions (+ inline gh issue create for fi_missing) | definition JSONs in tmp/flashinfer-trace/definitions/ — auto-dumped via FLASHINFER_TRACE_DUMP=1 for kernels with fi_trace_template=true, otherwise hand-written; manifest phase2_status=done (and fi_issue_url for fi_missing) |
| 3 | /collect-workloads (+ inline SGLang PR for sgl_missing) | workloads + blobs in tmp/flashinfer-trace/; manifest phase3_status=done |
| 4 | /submit-onboarding-prs | one HF PR + one bench PR per definition; manifest phase4 populated |
The state contract between skills is the run manifest at
tmp/onboard_{model_slug}_{date}.json — see "Run manifest" below.
# Discover new models and fully onboard any that are ready
/onboard-model --discover
# Onboard a specific known model end-to-end
/onboard-model --model-name qwen3-235b-a22b --hf-repo-id Qwen/Qwen3-235B-A22B
# Run only specific phases (e.g. skip workload collection for now)
/onboard-model --model-name kimi-k2 --phases 0,1,2
# Dry-run: discover and report what would be done without making changes
/onboard-model --discover --dry-run
--discover (optional): Auto-discover new models from SGLang day-0 additions and sgl-cookbook.
Compatible with --model-name.--model-name (optional): Specific model slug to onboard (e.g. qwen3-235b-a22b).--hf-repo-id (optional): HuggingFace repo override. Inferred from --model-name if omitted.--phases (optional): Comma-separated list of phases to run (default: 0,1,2,3,4).--dry-run (optional): Print what would be done without writing files or submitting PRs.--skip-workload (optional): Skip Phase 3 (e.g. when no GPU is available).--submit-prs (optional): Submit Phase 4 PRs (default: true).Delegate to the clone-repos skill.
/clone-repos
After the pull, capture the current SHAs and write them to the manifest's repo_shas:
git -C tmp/sglang rev-parse --short HEAD
git -C tmp/flashinfer rev-parse --short HEAD
git -C tmp/sgl-cookbook rev-parse --short HEAD
git -C tmp/flashinfer-trace rev-parse --short HEAD
Report the SHAs in the Phase 0 summary so the user can reproduce the run.
Delegate to the discover-models skill, which produces the
manifest's kernels[] array — each entry tagged with phase1_status (existing/new),
fi_status (fi_supported/fi_missing/n-a), and sgl_status (sgl_integrated/sgl_missing/n-a).
# In auto-discover mode
/discover-models --discover --manifest tmp/onboard_{model_slug}_{date}.json
# In single-model mode
/discover-models --model-name {model_slug} --hf-repo-id {hf_repo_id} \
--manifest tmp/onboard_{model_slug}_{date}.json
Subsequent phases iterate over kernels and act based on the per-entry classification.
For each kernel with phase1_status=new, generate a Definition JSON and write it into
tmp/flashinfer-trace/definitions/{op_type}/.
Delegate to extract-kernel-definitions. When
the kernel's FlashInfer API carries an @flashinfer_api(trace=...) decorator (i.e.
fi_trace_template=true in the manifest), one short SGLang inference pass with
FLASHINFER_TRACE_DUMP=1 and attention_backend="flashinfer" produces complete Definition
JSONs for every shape it touches — axes, inputs, outputs, tags (fi_api:*,
status:verified), and reference are filled in by the dumper. For decorated kernels
that the model didn't exercise (e.g. an unused page-size variant) or for FlashInfer APIs
not yet decorated, the same skill falls back to manual extraction from sgl-cookbook +
HF config.
/extract-kernel-definitions --model-name {sglang_model_name}
Verify each new definition now exists:
find tmp/flashinfer-trace/definitions/ -name "{definition_name}.json"
Update each kernel's phase2_status=done in the manifest.
When FlashInfer does not yet implement the kernel, the trace-dump path doesn't apply (no
decorated API to fire on). Generate the definition manually with SGLang's vanilla forward
as the reference (extract-kernel-definitions Path B), then file an issue against
flashinfer-ai/flashinfer. Mark the definition with the status:unverified tag.
gh issue create \
--repo flashinfer-ai/flashinfer \
--title "Kernel request: {op_type} for {model_name}" \
--label "enhancement,kernel-request" \
--body "$(cat <<'EOF'
## Kernel Request
**Model**: {model_display_name} ({hf_repo_id})
**Op type**: {op_type}
**Definition name**: {definition_name}
### Motivation
This kernel is required for serving **{model_display_name}** with FlashInfer.
A FlashInfer-Bench definition has been staged at:
`tmp/flashinfer-trace/definitions/{op_type}/{definition_name}.json`
(landing in the HuggingFace dataset PR for `flashinfer-ai/flashinfer-trace`)
### Kernel Parameters
{formatted parameter table from definition axes}
### Reference Implementation
A plain-PyTorch reference `run()` is available in the definition JSON above.
The SGLang implementation is at:
`python/sglang/srt/layers/{layer_path}`
### Requested Work
- [ ] CUDA/Triton kernel implementation matching the definition schema
- [ ] FlashInfer Python API (`flashinfer.{module}.{function}`)
- [ ] Unit test in `tests/test_{op_type}.py`
### Links
- HuggingFace dataset PR: (link once PR 2 is open)
- SGLang model: `tmp/sglang/python/sglang/srt/models/{model_file}`
- HuggingFace model: https://huggingface.co/{hf_repo_id}
EOF
)"
Record the issue URL in the manifest as fi_issue_url for that kernel and add it to the
definition's description:
"description": "... See flashinfer-ai/flashinfer#<issue_number> for kernel implementation request."
Do not proceed to Phase 3 for fi_missing kernels — workload collection requires the FlashInfer kernel to exist.
Skip entirely for fi_missing kernels.
ls tmp/flashinfer-trace/workloads/{op_type}/{definition_name}.jsonl 2>/dev/null
If the JSONL exists and is non-empty, mark phase3_status=done and skip.
For kernels classified sgl_missing, wire {fi_api} into the appropriate SGLang layer file
and open a PR against sgl-project/sglang. The change is typically small (import +
conditional dispatch).
# In layers/attention/flashinfer_backend.py or equivalent
try:
from flashinfer.{module} import {KernelClass}
FLASHINFER_{KERNEL}_AVAILABLE = True
except ImportError:
FLASHINFER_{KERNEL}_AVAILABLE = False
def forward(...):
if FLASHINFER_{KERNEL}_AVAILABLE and use_flashinfer:
return {KernelClass}(...).run(...)
else:
return vanilla_forward(...)
cd tmp/sglang
git checkout -b feat/flashinfer-{op_type}-integration-{model_slug}
git add python/sglang/srt/layers/...
git commit -m "feat: integrate FlashInfer {op_type} for {model_name}
Wire {fi_api} into SGLang's FlashInfer backend to enable
optimized {op_type} for {model_display_name}.
Needed for: flashinfer-ai/flashinfer-bench (workload collection)
FlashInfer API: {fi_api}
"
pre-commit run --all-files
git push origin HEAD
gh pr create \
--repo sgl-project/sglang \
--title "feat: integrate FlashInfer {op_type} for {model_name}" \
--body "$(cat <<'EOF'
## Summary
- Integrates `{fi_api}` into SGLang's FlashInfer backend
- Enables optimized `{op_type}` kernel for **{model_display_name}**
- Required by flashinfer-ai/flashinfer-bench for workload collection
## Changes
- `python/sglang/srt/layers/{path}`: add FlashInfer dispatch for {op_type}
## Test plan
- [ ] SGLang unit test passes with `--attention-backend flashinfer`
- [ ] Inference output matches non-FlashInfer baseline
- [ ] Memory usage within expected bounds
EOF
)"
Record the SGLang PR URL on the kernel entry. Pause Phase 3 for that kernel until the PR
merges; resume the run with --phases 3,4 once it lands.
Delegate to collect-workloads:
/collect-workloads \
--definition-names {definition_name} \
--model-name {model_name} \
--submit-pr false # PRs are submitted in Phase 4
Verify outputs and update the manifest:
ls tmp/flashinfer-trace/workloads/{op_type}/{definition_name}.jsonl
ls tmp/flashinfer-trace/blob/workloads/{op_type}/{definition_name}/
Mark phase3_status=done and record workload_entries.
Delegate to submit-onboarding-prs. It creates the
per-definition worktrees, spawns one agent per definition, and opens PR 2 (HuggingFace
dataset) followed by PR 1 (flashinfer-bench coverage doc) for each.
/submit-onboarding-prs --manifest tmp/onboard_{model_slug}_{date}.json
The skill writes back to the manifest's phase4 block with the resulting PR URLs.
The PR Review Checklist and Agent TASK.md template both live inside that skill — refer to
submit-onboarding-prs/SKILL.md rather than duplicating them here.
The contract between skills. Stored at tmp/onboard_{model_slug}_{date}.json:
{
"model_slug": "qwen3-235b-a22b",
"hf_repo_id": "Qwen/Qwen3-235B-A22B",
"date": "2026-04-27",
"repo_shas": {
"sglang": "abc1234",
"flashinfer": "def5678",
"sgl_cookbook": "ghi9012",
"flashinfer_trace": "jkl3456"
},
"kernels": [
{
"definition_name": "gqa_paged_decode_h40_kv8_d128_ps1",
"op_type": "gqa_paged",
"phase1_status": "new",
"fi_status": "fi_supported",
"fi_trace_template": true,
"sgl_status": "sgl_integrated",
"phase2_status": "done",
"phase2_method": "trace_dump",
"phase3_status": "done",
"workload_entries": 8
},
{
"definition_name": "rope_with_cos_sin_cache_neox_style_d128_rd128",
"op_type": "rope",
"phase1_status": "new",
"fi_status": "fi_supported",
"fi_trace_template": false,
"fi_trace_template_request_url": "https://github.com/flashinfer-ai/flashinfer/issues/3001",
"sgl_status": "sgl_integrated",
"phase2_status": "done",
"phase2_method": "manual"
},
{
"definition_name": "new_op_h512",
"op_type": "new_op",
"phase1_status": "new",
"fi_status": "fi_missing",
"sgl_status": "n/a",
"fi_issue_url": "https://github.com/flashinfer-ai/flashinfer/issues/999",
"phase2_status": "done",
"phase3_status": "skipped (fi_missing)"
}
],
"phase4": {
"gqa_paged_decode_h40_kv8_d128_ps1": {
"flashinfer_trace_pr": "https://huggingface.co/datasets/flashinfer-ai/flashinfer-trace/discussions/42",
"flashinfer_bench_pr": "https://github.com/flashinfer-ai/flashinfer-bench/pull/57"
}
}
}
Re-running /onboard-model with the same --model-name loads the existing manifest and
skips any kernel/phase already marked done — phases resume from the first incomplete step.
gh authenticated with write access to flashinfer-ai/flashinfer.gh authenticated with write access to sgl-project/sglang.--num-samples, or reduce --tp.nvidia-smi, reduce model size or --tp.FLASHINFER_DUMP_INCLUDE matches the actual API name from the
definition's fi_api tag.submit-onboarding-prs "Error Handling".gh CLI installed and authenticated (gh auth status)huggingface_hub Python package installed and logged in (huggingface-cli login)/clone-repos has been run at least once (Phase 0 will update)This file is a thin orchestrator. Update it only when:
All phase-specific procedures live in the phase skills linked from the overview table at the top — keep procedural detail there, not here.
To run a single phase by itself, invoke its skill directly (each is documented in its own
SKILL.md); /onboard-model --phases N is the same thing routed through the orchestrator
so the manifest stays in sync.
submit-onboarding-prs)