원클릭으로
discover-models
// Discover candidate LLMs and produce a kernel inventory — required definitions, classified as existing/new and fi_supported/fi_missing — for onboarding. Use as Phase 1 of /onboard-model, or standalone to plan onboarding work.
// Discover candidate LLMs and produce a kernel inventory — required definitions, classified as existing/new and fi_supported/fi_missing — for onboarding. Use as Phase 1 of /onboard-model, or standalone to plan onboarding work.
Auto-collect workloads from SGLang inference runs using FlashInfer logging API. Dumps tensors, sanitizes them according to kernel definitions, and submits PR to flashinfer-trace workload repo.
Generate Definition JSON files for the flashinfer-trace HuggingFace dataset by harvesting them from a short SGLang inference pass (FlashInfer's @flashinfer_api(trace=...) dumper) — or, as a fallback, by manually transcribing the schema from SGLang sources when FlashInfer doesn't yet have a trace template. Use when adding a new model, extracting GPU kernels (MLA, MoE, GQA, RMSNorm, GEMM, GDN, RoPE, sampling), or filling gaps in the dataset.
End-to-end pipeline for discovering new LLMs with novel kernels and onboarding them into FlashInfer-Bench. Orchestrates repo updates, model discovery, kernel definition generation, workload collection, and PR submission.
Add pytest tests to validate reference implementations in the flashinfer-trace HuggingFace dataset against FlashInfer or SGLang ground truth. Use when validating kernel definitions, adding tests for new op_types, or verifying reference implementations are correct.
Clone SGLang, FlashInfer, sgl-cookbook, and flashinfer-trace repositories to tmp/. Use when setting up the project, preparing for kernel extraction, or when the user needs the source repositories.
Open the per-definition pair of PRs that publishes a model onboarding — PR 2 to the HuggingFace flashinfer-trace dataset (definition + reference test + baseline solution + workloads + blobs + eval traces) and PR 1 to flashinfer-bench (docs/model_coverage.mdx update only). Use as Phase 4 of /onboard-model.
| name | discover-models |
| description | Discover candidate LLMs and produce a kernel inventory — required definitions, classified as existing/new and fi_supported/fi_missing — for onboarding. Use as Phase 1 of /onboard-model, or standalone to plan onboarding work. |
Identify target models and produce a per-kernel inventory:
tmp/flashinfer-trace/definitions/),Produces the kernels block of the onboard-model run manifest.
# Auto-discover candidate models added to SGLang in the last 30 days
/discover-models --discover
# Plan a specific model
/discover-models --model-name qwen3-235b-a22b --hf-repo-id Qwen/Qwen3-235B-A22B
# Write the inventory to a manifest file (consumed by /onboard-model)
/discover-models --model-name kimi-k2 --manifest tmp/onboard_kimi-k2_20260427.json
--discover (optional): Auto-discover candidates from SGLang day-0 additions and sgl-cookbook YAMLs.--model-name (optional): Specific model slug to plan (e.g. qwen3-235b-a22b).--hf-repo-id (optional): HuggingFace repo override (e.g. Qwen/Qwen3-235B-A22B). Inferred from --model-name if omitted.--manifest (optional): Path to an onboard-model run manifest. The skill writes the model_slug, hf_repo_id, repo_shas, and kernels array. If the file already exists, fields are merged; existing per-kernel statuses are preserved unless --refresh is set.--refresh (optional): Re-classify all kernels even if entries already exist in the manifest./clone-repos has been run, so tmp/sglang/, tmp/flashinfer/, tmp/sgl-cookbook/, and tmp/flashinfer-trace/ are present and current.huggingface_hub is installed and (for gated models) authenticated.Run only when --discover is set.
Day-0 SGLang additions (highest priority — production-ready):
git -C tmp/sglang log --since="30 days ago" --name-status --diff-filter=A \
-- "python/sglang/srt/models/*.py" | grep "^A" | awk '{print $2}'
Models with a brand-new .py under python/sglang/srt/models/ in the last 30 days are
day-0 candidates. Parse the model class to derive a slug.
sgl-cookbook new entries:
git -C tmp/sgl-cookbook log --since="30 days ago" --name-status --diff-filter=A \
-- "data/models/generated/v0.5.6/*.yaml" | grep "^A" | awk '{print $2}'
A new YAML signals a model with a recommended serving config.
Filter already-tracked models: read docs/model_coverage.mdx Summary table and skip any
candidate already listed.
For each candidate (or the specified --model-name):
from huggingface_hub import hf_hub_download
import json
config_path = hf_hub_download(repo_id=hf_repo_id, filename="config.json")
with open(config_path) as f:
config = json.load(f)
Key fields to extract: see track-models SKILL.md for the full config.json → kernel param
mapping table.
Use the per-op-type formulas in track-models Phase 3a to compute the expected definition
names from the model config and the sgl-cookbook TP/EP values. Each formula yields a fully
qualified definition name like gqa_paged_decode_h40_kv8_d128_ps1.
For each expected definition name, search the HuggingFace dataset clone (definitions live only there after the trace-dataset refactor):
find tmp/flashinfer-trace/definitions/ -name "{definition_name}.json"
| Result | Classification |
|---|---|
| File found | existing — no new definition needed |
| Not found | new — proceed to FlashInfer-availability classification |
For each new definition, determine whether FlashInfer already implements the underlying kernel.
| op_type | Check path in tmp/flashinfer/ |
|---|---|
rmsnorm | flashinfer/norm.py — grep for rmsnorm |
gqa_paged | flashinfer/decode.py, flashinfer/prefill.py |
gqa_ragged | flashinfer/prefill.py |
mla_paged | flashinfer/mla.py |
dsa_paged | flashinfer/sparse.py |
gdn | flashinfer/gdn.py or flashinfer/gdn/ |
moe | flashinfer/fused_moe/ — check the specific variant |
gemm | always available via PyTorch |
sampling | flashinfer/sampling.py |
mamba_ssu | flashinfer/mamba.py — grep for selective_state_update |
rope | flashinfer/rope.py — grep for apply_rope_with_cos_sin_cache |
Also check tmp/flashinfer/tests/ for a corresponding test file — its presence is a strong
signal the kernel is implemented and tested.
A stronger signal that the kernel is fully ready for the trace-dump path (Path A in
extract-kernel-definitions) is whether the
FlashInfer API carries an @flashinfer_api(trace=...) decorator (added by
flashinfer-ai/flashinfer#2931).
Check with:
grep -rn "@flashinfer_api(trace=" tmp/flashinfer/flashinfer/ | grep -i "{module_or_api}"
If the API is decorated, Phase 2 can produce its Definition JSON automatically by running
a short SGLang inference pass with FLASHINFER_TRACE_DUMP=1. If FlashInfer has the kernel
but not the decorator, classification is still fi_supported but Phase 2 falls back to
manual extraction. Record the decorator-presence flag on the manifest entry as
fi_trace_template (true/false) so reviewers know which path to expect.
Classify each new definition:
fi_trace_template=true, else manual extraction; see extract-kernel-definitions).For each fi_supported definition, determine whether SGLang already routes through the
FlashInfer kernel. The result drives Phase 3 (workload collection).
# Use the fi_api tag from the definition (or the expected wrapper name) to grep:
grep -r "{flashinfer_api_name}" tmp/sglang/python/sglang/srt/ 2>/dev/null | grep -v __pycache__
Common mapping:
| fi_api | SGLang integration file | Search term |
|---|---|---|
flashinfer.mla.BatchMLAPagedAttentionWrapper | layers/attention/flashinfer_backend.py | BatchMLAPagedAttentionWrapper |
flashinfer.decode.BatchDecodeWithPagedKVCacheWrapper | layers/attention/flashinfer_backend.py | BatchDecodeWithPagedKVCacheWrapper |
flashinfer.prefill.BatchPrefillWithPagedKVCacheWrapper | layers/attention/flashinfer_backend.py | BatchPrefillWithPagedKVCacheWrapper |
flashinfer.norm.rmsnorm | layers/layernorm.py | flashinfer.norm |
flashinfer.fused_moe.trtllm_fp8_block_scale_moe | layers/moe/fused_moe.py | trtllm_fp8_block_scale_moe |
flashinfer.gdn.gated_delta_rule_decode | layers/attention/gdn_backend.py | gated_delta_rule_decode |
flashinfer.mamba.selective_state_update | layers/mamba/mamba_mixer.py | selective_state_update |
Classify:
For fi_missing definitions, SGLang integration is moot (no FlashInfer kernel to call) — set sgl_status to n/a.
Print a classification table:
Model: Qwen3-235B-A22B
HF repo: Qwen/Qwen3-235B-A22B
Architecture: 94 layers, GQA + MoE
Kernel inventory:
EXISTING (skip):
✅ rmsnorm_h7168
✅ moe_fp8_block_scale_ds_routing_topk8_ng8_kg4_e32_h7168_i2048
NEW — FlashInfer supported, SGLang integrated → ready for workload collection:
🆕 gqa_paged_decode_h40_kv8_d128_ps1
🆕 gqa_paged_decode_h40_kv8_d128_ps64
NEW — FlashInfer supported, SGLang missing → submit SGLang PR first:
🆕 dsa_topk_indexer_fp8_h64_d128_topk2048_ps64
NEW — FlashInfer MISSING → file kernel-request issue, skip workload collection:
❓ <new_op_type>_<params>
When --manifest <path> is set, write/update a JSON file with this shape (the same manifest
consumed by /onboard-model and /submit-onboarding-prs):
{
"model_slug": "qwen3-235b-a22b",
"hf_repo_id": "Qwen/Qwen3-235B-A22B",
"date": "2026-04-27",
"repo_shas": {
"sglang": "abc1234",
"flashinfer": "def5678",
"sgl_cookbook": "ghi9012",
"flashinfer_trace": "jkl3456"
},
"kernels": [
{
"definition_name": "gqa_paged_decode_h40_kv8_d128_ps1",
"op_type": "gqa_paged",
"phase1_status": "new",
"fi_status": "fi_supported",
"fi_trace_template": true,
"sgl_status": "sgl_integrated"
},
{
"definition_name": "rmsnorm_h7168",
"op_type": "rmsnorm",
"phase1_status": "existing"
},
{
"definition_name": "new_op_h512",
"op_type": "new_op",
"phase1_status": "new",
"fi_status": "fi_missing",
"sgl_status": "n/a"
}
]
}
Existing entries written by later phases (phase2_status, phase3_status, workload_entries,
fi_issue_url, phase4) are preserved on update.