Run any Skill in Manus with one click

ad-model-onboard

Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.

Run Skill in Manus

Stars13,870

Forks2,466

UpdatedMay 25, 2026 at 10:13

Source

NVIDIA

NVIDIA/TensorRT-LLM

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

SKILL.md

readonly

AutoDeploy Model Onboarding

Input: HuggingFace model ID. Output: prefill-only custom model file + hierarchical tests + summary report.

Phase 0 — Gather All Resources Upfront

Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.

Step 0 — GPU memory sanity check

Before anything else, check whether the model can fit on the current system.

Run nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits to get the total VRAM (in MiB) across all GPUs on the system.
Estimate the model's memory footprint from the HuggingFace model card or config (number of parameters × bytes per parameter, e.g. 7B params × 2 bytes = ~14 GB for bfloat16).
If the estimated size exceeds total system VRAM, stop and report this to the user — do not proceed with onboarding until the user acknowledges and decides how to proceed. Example message: "This model requires ~Xgb but the system only has Ygb across N GPUs. Onboarding is likely to fail at the e2e run stage."

Step 1 — Check local transformers install first:

python -c "import transformers; print(transformers.__file__)"

Look for models/{model_type}/modeling_*.py under that path. If found, use it directly — no network needed.

Step 2 — If not found, download the HF repo (code only, skip weights):

huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"

This downloads config, code, and tokenizer files into the standard HF cache ($HF_HOME or ~/.cache/huggingface/) while skipping large weight files. Files cached here are automatically found by transformers.AutoConfig.from_pretrained and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read config.json and modeling_*.py from the cache snapshot directory printed by the command.

Phase 1 — Survey Existing Coverage & Analyze HF Model

Step 1 — Check for existing AD custom modeling code

Before writing anything, check if an AD custom model already covers this architecture:

Read the model's config.json to find its model_type and architectures fields.
Search tensorrt_llm/_torch/auto_deploy/models/custom/ for existing modeling_*.py files that register the same config class name (grep for the architectures value or model_type).
Also check tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py for existing registrations.

If existing code is found:

Read it carefully. It may already handle this exact model — in which case no new modeling file is needed, only registry entries and possibly tests.
If the existing code covers a closely related model in the same family but needs adaptation (e.g., the family added MoE in a newer variant, or changed the attention type), decide whether to extend the existing file or create a new one. Prefer extending if the changes are minor; create a new file if the architecture diverges significantly. Report the decision and rationale to the user before proceeding.

If no existing code is found: proceed to write a new model file in Phase 2.

Step 2 — Survey the model family in the registry

Check examples/auto_deploy/model_registry/models.yaml for other models from the same family (e.g., if asked to onboard Qwen/Qwen3-8B, look for Qwen/Qwen3-0.6B, Qwen/Qwen3-32B, Qwen/Qwen3-235B-A22B, etc.). Also check HuggingFace for the full set of model sizes/variants in the family.

Identify which family members already have registry entries and which are missing.
Identify which family members share the same architecture (same model_type / architectures in their config) — these can all use a single modeling file.
Plan to onboard the entire family cohesively: one modeling file + one test file should cover all members that share an architecture. The registry should have entries for all commonly-used sizes.
Report the family survey findings to the user: which models exist, which are missing, and the proposed plan for covering them all.

Step 3 — Analyze HF model architecture

Study the locally-available config.json and modeling_*.py (NOT from tensorrt_llm/_torch/models/). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break torch.export (e.g. torch.nonzero, data-conditioned if).

Phase 2 — Write a Lean Prefill-Only Model

Create tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py. Use modeling_glm4_moe_lite.py as a structural template only (class layout, dataclass outputs, forward signature).

The goal is a minimal prefill-only model for torch.export with AD canonical IR ops. Keep the code as lean as possible — every line should serve the export path. Do not port HF features that AD doesn't need.

Strip: KV cache, training paths, dropout, flash attention variants, repeat_interleave/repeat_kv for GQA (AD attention ops handle this natively), fallback logic for generating position_ids (assert instead), optional code paths gated on config flags irrelevant to prefill export.

Keep: PreTrainedModel hierarchy, ModelOutput dataclass, minimal forward (input_ids, position_ids, inputs_embeds=None, **kwargs).

Critical: Make sure the custom modeling code nn.Module hierarchy matches what the checkpoint safetensor json expects.

Critical rule: Do NOT import or reuse existing AD custom model code (e.g. from .modeling_deepseek import ...). Every modeling_{name}.py must be self-contained. Use the HF source ($CLONE_DIR/modeling_*.py) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.

Phase 3 — Use AutoDeploy Canonical Ops (CRITICAL)

Use torch.ops.auto_deploy.torch_* canonical ops WHENEVER POSSIBLE. These are the IR nodes that AD transforms later replace with optimized backends (triton, flashinfer, trtllm) at deployment time. If a canonical op exists for an operation, you MUST use it — do not reimplement the logic in plain PyTorch.

Available canonical ops (see tensorrt_llm/_torch/auto_deploy/custom_ops/README.md for full list):

Attention: torch_attention, torch_attention_sdpa, torch_attention_repeat_kv
MLA: torch_mla
RoPE: torch_rope_with_explicit_cos_sin, torch_rope_with_complex_freqs, torch_rope_with_qk_interleaving
MoE: torch_moe, torch_moe_fused, torch_moe_router, torch_moe_dense_mlp
Normalization: torch_rmsnorm, torch_rmsnorm_gated, torch_l2norm
Linear: torch_linear_simple
SSM/Mamba: torch_ssm, torch_causal_conv1d
FLA: torch_gated_delta_rule
Quantization: torch_quant_fp8_linear, torch_quant_nvfp4_linear, etc.

Never use triton_*/flashinfer_*/trtllm_* — backend selection happens later in AD transforms. Plain PyTorch is acceptable ONLY for operations where no canonical op exists (e.g., simple activation functions, embedding lookups, basic tensor arithmetic). If you find yourself writing manual attention, MoE routing, RoPE, or normalization in plain PyTorch, stop and use the canonical op instead.

Do NOT use repeat_interleave or repeat_kv for GQA. HF reference code often repeats K/V heads to match the Q head count before attention. The AD canonical attention ops (torch_attention, torch_attention_sdpa) handle GQA natively — they accept Q, K, V with different head counts and do the right thing internally. Manually repeating K/V heads is unnecessary bloat and prevents AD from optimizing the attention path.

Phase 4 — Register

Bottom of model file: AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM).
Add import + __all__ entry in models/custom/__init__.py.
Prefer reusing the existing config class — if the config can be loaded via AutoConfig.from_pretrained(model_id) (either from the installed transformers or from files in the HF cache downloaded in Phase 0), import it from transformers and use it directly. Do NOT recreate or copy the config class into the modeling file when it is already available. Note: AD's factory already calls AutoConfig.from_pretrained(model_id, trust_remote_code=True) and passes the result to your model, so you rarely need to import the config at all — if you find yourself doing so, sanity-check that it's genuinely needed.
Only if the config is truly not available (not in transformers and not bundled with the checkpoint), define a minimal config class in the modeling file and AutoConfig.register(model_type, ConfigCls, exist_ok=True). A good sanity check: if the E2E test passes without a custom config class, you don't need one — AutoConfig.from_pretrained already picked up the right class.

Phase 5 — Model Input Contract

The custom model's forward signature must follow these rules:

Always input_ids — The top-level model always receives input_ids. A submodule graph may internally receive inputs_embeds (e.g., after the embedding layer), but the exported entry point takes token IDs.
Always position_ids — Vanilla sequential position_ids are always provided. Assert position_ids is not None at the top of the forward method — it is a required input, never optional. Do not include fallback logic to generate position_ids from input_ids (HF models often do this; strip it). If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of the provided vanilla position_ids.
Multi-modal inputs — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside input_ids.
No attention mask, no cache inputs, no HF-runtime features — Do not accept attention_mask, past_key_values, use_cache, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.

Phase 6 — Hierarchical Tests

Create tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py. Use test_glm4_moe_lite_modeling.py as template. No smoke tests. Small config (hidden=64, layers=2-3, vocab=1000). Use pytest.skip if HF class unavailable.

HF Reference Strategy: Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs. Use actual HF classes if they exist — prefer importing directly over standalone HF-like implementations for unit tests. Standalone "reference" implementations are effectively alternative AD IR models and defeat the purpose of the reference test; they also tend to silently agree with whatever bugs exist in the custom model.

If HF modules exist in the installed transformers: import them directly (e.g., from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM). Wrap imports in _get_hf_*_class() try/except helpers that return None on ImportError, and use pytest.skip when None.
If HF modules are NOT in the installed transformers: copy the minimal module definitions from the HF modeling_*.py source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific transformers version or HF cache at test time. Important: make sure the copy is minimal and strictly faithful to the HF implementation only. Do NOT tweak the functionality of the reference. The same applies to config classes that use trust_remote_code (i.e., not available in transformers): copy a minimal faithful version into the test file. The modeling file should NOT import the config class — AD loads it at runtime via AutoConfig.from_pretrained(..., trust_remote_code=True). The test-only config copy lets you verify config-wrapping behavior (e.g., structure of state_dict).
Weight conversion helpers: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using load_state_dict pre-hooks already registered on the custom model.

Numerical comparison: For equivalence tests comparing custom ops against HF reference, use the shared assert_rmse_close utility from _model_test_utils:

from _model_test_utils import assert_rmse_close

This computes rmse(actual - expected) / rmse(expected) — more robust than per-element torch.testing.assert_close since a few outlier elements won't fail the test. Use torch.testing.assert_close only for blocks with identical math (e.g., plain MLP with no custom ops).

Recommended rmse_ratio_tol values for bfloat16:

Identical math (MLP, Norm): use torch.testing.assert_close with tight rtol/atol (1e-3)
MoE block (fused routing): 0.02
Decoder layer / MoE layer / full model: 0.05
Attention: 0.10

Bottom-up levels (each must pass before next):

Block equivalence — Test MLP, Attention, MoE, Norm individually: same weights + same input → assert_rmse_close (or torch.testing.assert_close for identical-math blocks).
Layer equivalence — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
Full model equivalence — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
Export test — torch_export_to_gm with Dim.DYNAMIC for batch+seq, verify finite output, test a second shape.

Phase 7 — Independent Review (MANDATORY)

Invoke the ad-onboard-reviewer subagent with ONLY the following information:

Model name
Path to the model file created
Path to the test file created

Do NOT include your own assessment of correctness. Do NOT summarize what you did. Let the reviewer read the files and judge independently.

If the reviewer returns FAIL on any item:

Read the reviewer's specific failure reasons and file:line references
Fix each failed item
Invoke the reviewer again with the same minimal inputs
Repeat until you get a full PASS

Do NOT proceed to Phase 8 until the reviewer returns PASS.

Phase 8 — Create or Update Model Registry Entries (Including Family)

Before running the model end-to-end, ensure it and all identified family members from Phase 1 have valid entries in the AutoDeploy model registry at examples/auto_deploy/model_registry/.

For each model (the requested model + any family members identified in Phase 1 Step 2):

Check examples/auto_deploy/model_registry/models.yaml for an existing entry matching the model's HF id.
If the entry is missing, add it with the appropriate yaml_extra list:
- Always include dashboard_default.yaml first.
- Pick world_size_N.yaml based on model size (1 for <2B, 2 for 2-15B, 4 for 20-80B, 8 for 80B+). The world_size determines how many GPUs are needed for the run.
- Add model-specific YAML if the model needs custom settings (e.g., model_kwargs, non-default transforms).
If a model-specific config YAML is needed and doesn't exist, create it under examples/auto_deploy/model_registry/configs/. See existing configs for format examples.
If the entry exists but needs changes (e.g., wrong world_size, missing model-specific config), update it.

Family members that share the same architecture should all use the same modeling code. Different sizes only need different world_size_N.yaml entries and maybe different sharding configurations.

See examples/auto_deploy/model_registry/README.md for full documentation on the registry format and best practices.

Phase 9 — AutoDeploy End-to-End Run

⚠️ MANDATORY: You MUST use the standalone config YAML with `--args.yaml-extra` ⚠️

You MUST run the model using the standalone config YAML created in Phase 8. The same YAML will be referenced by the cookbook's trtllm-serve command in Phase 11. The command is:

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

The standalone config YAML under examples/auto_deploy/model_registry/configs/ is self-contained — it includes all settings needed for running the model (compile backend, batch size, seq len, transforms, world_size, etc.). This is the same YAML that trtllm-serve --extra_llm_api_options will use in the cookbook, so validating it here ensures the cookbook works out of the box.

If the run FAILS:

Fix the standalone config YAML — update settings in examples/auto_deploy/model_registry/configs/<model>.yaml and re-run.
The standalone config YAML is the source of truth. If it is wrong, fix it. If it is missing settings, add them. The model MUST work via this YAML before you are done.

Invoke the ad-run-agent subagent to run the model through AutoDeploy on GPU. Pass it:

Step 1: Reduced num layers Run with reduced num layers to test the e2e flow for issues and iterate faster. The generation will be bad in step 1 because we are not loading all layers.

Step 2: Full layers Run with full num layers. The generation should be coherent in step 2.

Model HF ID: the HuggingFace model-id (or local checkpoint path) used throughout onboarding
Standalone config YAML path: the path to the config YAML under examples/auto_deploy/model_registry/configs/
Description: a short description of the current state, e.g.:
- "first try after onboarding"
- "updated yaml with reduced layers"
- "changed attention backend to torch_mha"
- "fixed weight loading hooks"

The model is run via:

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

The ad-run-agent will determine the required world_size from the config YAML, check GPU availability via nvidia-smi, select free GPUs, and wait if not enough are available.

The ad-run-agent will build+run the model, check generation quality, archive logs, and update its worklog.

If the run fails or produces bad generation:

Read the ad-run-agent's worklog and log file to understand the error
Fix the issue (model code, standalone config YAML, weight hooks, etc.)
Re-invoke the ad-run-agent with an updated description reflecting the change (e.g., "retry after fixing RoPE scaling in config")
Always re-run with --args.yaml-extra. Fix the standalone config YAML, don't work around it.
Repeat until the run succeeds with meaningful generation

Do NOT proceed to Phase 10 until the step 2 with full layers reports a successful run with coherent generation.

Important: The successful E2E run outputs (prompts and generated text) will be needed for the cookbook notebook in Phase 11 and the summary report in Phase 12. Save them.

Phase 10 — Update Model Support Matrix

After a successful E2E run, update the TensorRT-LLM model support matrix at docs/source/models/supported-models.md to include the newly onboarded model.

Read the current support matrix to understand the format and existing entries.
Add a row to the "Supported Models" table (the first table in the file) with:
- Architecture: The model's architecture class name (e.g., MiniMaxM2ForCausalLM) — use the class name registered in Phase 4.
- Model: The model family/display name (e.g., MiniMax M2/M2.1/M2.7).
- HuggingFace Example: A representative HF model ID (e.g., MiniMaxAI/MiniMax-M2.7).
- Place the new row alphabetically by architecture class name to keep the table sorted.
If the model is AutoDeploy-only (i.e., it does NOT have native PyTorch backend support in tensorrt_llm/_torch/models/), add a footnote indicating AutoDeploy support with a link to the AD config YAML, following the pattern of existing AD-only models (e.g., [^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml).).
If the model warrants an entry in the Model-Feature Support Matrix (second table — typically for key/flagship models), add a row there too. For newly onboarded AD models, most advanced features should be marked Untested unless you have verified them. Use existing AD model entries (e.g., Glm4MoeLiteForCausalLM) as a reference for which features to mark as supported vs untested.

Phase 11 — Create AutoDeploy Cookbook

Create an AutoDeploy cookbook notebook for the model, following the pattern of existing cookbooks.

Use examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb as the template. Copy its structure exactly.
Create the new notebook at examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb, using a snake_case version of the model name (e.g., minimax_m2.7_trtllm_cookbook.ipynb).
Adapt all model-specific content:
- Title and description: update the model name, HF model ID, and description.
- Model Resources: update links to the model's HuggingFace card, blog posts, technical reports, API platform, and community links. Search the web or the model's HF card for relevant URLs.
- Model Highlights: update architecture details (e.g., MoE params, context length, special features like tool calling, interleaved thinking, etc.) from the model card.
- Prerequisites: update VRAM requirements based on model size and precision.
- trtllm-serve command: update the model ID and use --extra_llm_api_options pointing to the standalone AD config YAML under examples/auto_deploy/model_registry/configs/ (e.g., examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml). This is the same standalone config YAML validated in Phase 9 via build_and_run_ad.py --args.yaml-extra. It is self-contained — it includes all the settings trtllm-serve needs (compile backend, batch size, seq len, transforms, etc.).
- OpenAI client MODEL_ID: update to the correct HF model ID.
- Evaluation Parameters: update recommended inference parameters from the model's documentation/model card.
- Additional Resources: update all links to be model-specific.
Do NOT include cell outputs in the committed notebook — the notebook should be clean with no pre-run outputs, so users run it themselves. (Exception: if the model was already run and outputs were captured during Phase 9, you may include them for reference, but this is optional.)
Verify the notebook is valid JSON — malformed .ipynb files will not render on GitHub or in Jupyter.

Phase 12 — Summary Report

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final `build_and_run_ad.py --args.yaml-extra` run ⚠️

Print (not file) after completion:

Model overview + unique features
Tricky parts needing human review
Files created/modified (including any new registry configs)
Test results table (name | validates | PASS/FAIL)
Known limitations
Reviewer result (PASS + how many review iterations it took)
AD end-to-end run result (success/fail, number of iterations, final generation quality)
Registry entry added/updated in models.yaml and any new config YAMLs created
ALL raw prompts and their corresponding generated outputs from the final successful build_and_run_ad.py --args.yaml-extra run. Copy-paste the COMPLETE prompt→output pairs verbatim from the run log. Do NOT summarize, truncate, or paraphrase them. The user needs to see exactly what the model generated to judge quality.
Model support matrix update — confirm the row was added to docs/source/models/supported-models.md and which footnote (if any) was used.
AutoDeploy cookbook created — path to the new notebook file (examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb).

Phase 13 — Prepare a Pull Request

GitHub CLI config: Before running any gh command, confirm which GH_CONFIG_DIR to use. The default is ~/.config/gh, but a different directory may be needed when targeting a fork (e.g., nv-auto-deploy/TensorRT-LLM vs NVIDIA/TensorRT-LLM). Check if the user has specified a custom GH_CONFIG_DIR (e.g., in CLAUDE.local.md or environment). If not, ask the user before proceeding. Prefix all gh commands with: GH_CONFIG_DIR=<path> gh ...

Prepare a pull request against upstream (https://github.com/NVIDIA/TensorRT-LLM) targeting branch main. Then, ask the user to provide feedback on the PR and wait for the user to get back to you when the feedback has been posted. Then continue iterating according to the user's feedback. For any comment or other post, please prepend your message with "[AGENT]" so that it is clear that this was a coding agent posting the comment. When you post a PR, you MUST include:

ALL raw prompts and their complete generated outputs from the final successful build_and_run_ad.py --args.yaml-extra run. Copy-paste the COMPLETE prompt→output pairs verbatim — do NOT summarize, truncate, or paraphrase. The reviewer needs to see exactly what the model generated.
A reproducible command:

python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

A detailed pytest command for the unit tests you added so they can be run by the reviewer as well. Make sure you have run this pytest command on the latest commit that you are pushing, and include these results in the PR.

⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️

Every single time you push changes to the PR — whether it is a new commit, a rebase, an amendment, a fixup, or any other update — you MUST:

Re-run build_and_run_ad.py --args.yaml-extra using the ad-run-agent subagent, exactly as in Phase 9. The code has changed, so previous run results are stale and invalid.
Re-run the full unit test suite (pytest <test_file> -v) for the model's test file created in Phase 6. Previous test results are stale and invalid after any code change.
Post ALL raw output from both runs as a PR comment:
- The COMPLETE prompt→output pairs from build_and_run_ad.py verbatim — do NOT summarize, truncate, or paraphrase.
- The COMPLETE pytest output verbatim — every test name, every PASSED/FAILED line, every error traceback if any. Do NOT summarize or cherry-pick.

This is not optional. There are no exceptions. Even if the change seems trivial (a typo fix, a comment edit, a formatting change), both runs must be re-executed and the full raw logs must be posted. The reviewer cannot verify correctness without seeing generation output AND test results from the exact code that is currently on the branch.

Workflow for every PR update cycle:

Make the requested code changes
Commit the changes
Before pushing, always rebase onto the target branch to check for conflicts: git fetch upstream && git rebase upstream/main. If there are conflicts, resolve them before proceeding. Do NOT push without rebasing first — the branch must be up-to-date with the target branch.
Push (or force-push if rebase rewrote history)
Re-invoke the ad-run-agent to run build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml on the updated code
Re-run the unit tests: pytest <test_file> -v
Wait for both runs to complete
Post a reply to every PR comment containing:
- A brief description of what changed in this update
- The COMPLETE raw prompts and generated outputs from the build_and_run_ad.py run
- The COMPLETE raw pytest output (full verbatim log)
- The reproducible commands used for both runs
Resume polling for new comments (see below)

⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️

After opening the PR and after every PR update you post, you MUST set up a polling loop that checks for new PR comments every 5 minutes. Do not simply post and walk away — actively monitor the PR for reviewer feedback.

How to poll:

# Fetch all PR comments, sorted newest-first, and check for any posted after your last comment
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
# Also check issue-level comments (top-level PR comments, not inline review comments)
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
# Also check the PR's review status
GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state

Polling loop behavior:

After posting your PR (or posting an update comment), immediately start polling every 5 minutes.
On each poll, check for:
- New review comments (inline or top-level) posted after your last comment's timestamp
- PR approval status — check if the PR has been approved
- Termination signals — any comment clearly indicating the agent's work is done (e.g., "LGTM", "looks good, we're done", "no more changes needed", "agent work complete", or similar)
If new actionable comments are found: stop polling, process the feedback, and execute the full PR update cycle (steps 1–8 above). After posting the update, resume polling.
If the PR is approved or a termination signal is found: stop polling, report to the user that the PR review cycle is complete, and end.
If no new comments are found: sleep 5 minutes and poll again.

Do NOT stop polling prematurely. The loop must continue until the PR is approved or a clear termination signal is received. If polling has been running for an extended period (e.g., >2 hours) with no new activity, inform the user that you are still monitoring and ask if they want you to continue or stop.

Sharding-aware IR model porting

For porting an existing custom model to a sharding-aware _ir.py variant, see the ad-sharding-ir-port skill.

Key Gotchas

Canonical ops first: Always use torch.ops.auto_deploy.torch_* canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.
No repeat_interleave: AD attention ops handle GQA natively. Never repeat K/V heads manually.
Lean code: Every line should serve prefill export. No optional HF features, no dead code paths, no fallback logic.
Reuse config classes: Import from transformers or load from checkpoint whenever possible. Only bundle a config class if it truly doesn't exist anywhere.
Assert position_ids: Always assert position_ids is not None — it is a required input, never optional.
Self-contained files only: Never import from other AD custom models. Each modeling_{name}.py is a standalone translation from HF source.
RoPE cos/sin: slice ONCE, not per layer. _ad_ prefix for RoPE buffers. RotaryEmbedding.forward(x, position_ids) MUST slice by position_ids once and return pre-sliced (cos, sin). Pass those tensors to all layers. NEVER pass position_ids through to each layer/attention forward to re-index — that is redundant compute that bloats the exported graph. See Phase 2 for the full pattern.
MoE weights: use nn.ModuleList per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
noaux_tc routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused trtllm kernels at deployment time.
Vision towers are typically not exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
Model code and tests must run on CPU. Use only torch_* prefixed reference ops in AutoDeploy — never triton_*, flashinfer_*, or trtllm_*.

name	ad-model-onboard
description	Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.
license	Apache-2.0
metadata	{"author":"NVIDIA Corporation"}

AutoDeploy Model Onboarding

Input: HuggingFace model ID. Output: prefill-only custom model file + hierarchical tests + summary report.

Phase 0 — Gather All Resources Upfront

Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.

Step 0 — GPU memory sanity check

Before anything else, check whether the model can fit on the current system.

Run nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits to get the total VRAM (in MiB) across all GPUs on the system.
Estimate the model's memory footprint from the HuggingFace model card or config (number of parameters × bytes per parameter, e.g. 7B params × 2 bytes = ~14 GB for bfloat16).
If the estimated size exceeds total system VRAM, stop and report this to the user — do not proceed with onboarding until the user acknowledges and decides how to proceed. Example message: "This model requires ~Xgb but the system only has Ygb across N GPUs. Onboarding is likely to fail at the e2e run stage."

Step 1 — Check local transformers install first:

python -c "import transformers; print(transformers.__file__)"

Look for models/{model_type}/modeling_*.py under that path. If found, use it directly — no network needed.

Step 2 — If not found, download the HF repo (code only, skip weights):

huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"

Phase 1 — Survey Existing Coverage & Analyze HF Model

Step 1 — Check for existing AD custom modeling code

Before writing anything, check if an AD custom model already covers this architecture:

Read the model's config.json to find its model_type and architectures fields.
Search tensorrt_llm/_torch/auto_deploy/models/custom/ for existing modeling_*.py files that register the same config class name (grep for the architectures value or model_type).
Also check tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py for existing registrations.

If existing code is found:

Read it carefully. It may already handle this exact model — in which case no new modeling file is needed, only registry entries and possibly tests.
If the existing code covers a closely related model in the same family but needs adaptation (e.g., the family added MoE in a newer variant, or changed the attention type), decide whether to extend the existing file or create a new one. Prefer extending if the changes are minor; create a new file if the architecture diverges significantly. Report the decision and rationale to the user before proceeding.

If no existing code is found: proceed to write a new model file in Phase 2.

Step 2 — Survey the model family in the registry

Identify which family members already have registry entries and which are missing.
Identify which family members share the same architecture (same model_type / architectures in their config) — these can all use a single modeling file.
Plan to onboard the entire family cohesively: one modeling file + one test file should cover all members that share an architecture. The registry should have entries for all commonly-used sizes.
Report the family survey findings to the user: which models exist, which are missing, and the proposed plan for covering them all.

Step 3 — Analyze HF model architecture

Phase 2 — Write a Lean Prefill-Only Model

Create tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py. Use modeling_glm4_moe_lite.py as a structural template only (class layout, dataclass outputs, forward signature).

Keep: PreTrainedModel hierarchy, ModelOutput dataclass, minimal forward (input_ids, position_ids, inputs_embeds=None, **kwargs).

Critical: Make sure the custom modeling code nn.Module hierarchy matches what the checkpoint safetensor json expects.

Phase 3 — Use AutoDeploy Canonical Ops (CRITICAL)

Available canonical ops (see tensorrt_llm/_torch/auto_deploy/custom_ops/README.md for full list):

Attention: torch_attention, torch_attention_sdpa, torch_attention_repeat_kv
MLA: torch_mla
RoPE: torch_rope_with_explicit_cos_sin, torch_rope_with_complex_freqs, torch_rope_with_qk_interleaving
MoE: torch_moe, torch_moe_fused, torch_moe_router, torch_moe_dense_mlp
Normalization: torch_rmsnorm, torch_rmsnorm_gated, torch_l2norm
Linear: torch_linear_simple
SSM/Mamba: torch_ssm, torch_causal_conv1d
FLA: torch_gated_delta_rule
Quantization: torch_quant_fp8_linear, torch_quant_nvfp4_linear, etc.

Phase 4 — Register

Bottom of model file: AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM).
Add import + __all__ entry in models/custom/__init__.py.
Prefer reusing the existing config class — if the config can be loaded via AutoConfig.from_pretrained(model_id) (either from the installed transformers or from files in the HF cache downloaded in Phase 0), import it from transformers and use it directly. Do NOT recreate or copy the config class into the modeling file when it is already available. Note: AD's factory already calls AutoConfig.from_pretrained(model_id, trust_remote_code=True) and passes the result to your model, so you rarely need to import the config at all — if you find yourself doing so, sanity-check that it's genuinely needed.
Only if the config is truly not available (not in transformers and not bundled with the checkpoint), define a minimal config class in the modeling file and AutoConfig.register(model_type, ConfigCls, exist_ok=True). A good sanity check: if the E2E test passes without a custom config class, you don't need one — AutoConfig.from_pretrained already picked up the right class.

Phase 5 — Model Input Contract

The custom model's forward signature must follow these rules:

Always input_ids — The top-level model always receives input_ids. A submodule graph may internally receive inputs_embeds (e.g., after the embedding layer), but the exported entry point takes token IDs.
Always position_ids — Vanilla sequential position_ids are always provided. Assert position_ids is not None at the top of the forward method — it is a required input, never optional. Do not include fallback logic to generate position_ids from input_ids (HF models often do this; strip it). If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of the provided vanilla position_ids.
Multi-modal inputs — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside input_ids.
No attention mask, no cache inputs, no HF-runtime features — Do not accept attention_mask, past_key_values, use_cache, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.

Phase 6 — Hierarchical Tests

If HF modules exist in the installed transformers: import them directly (e.g., from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM). Wrap imports in _get_hf_*_class() try/except helpers that return None on ImportError, and use pytest.skip when None.
If HF modules are NOT in the installed transformers: copy the minimal module definitions from the HF modeling_*.py source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific transformers version or HF cache at test time. Important: make sure the copy is minimal and strictly faithful to the HF implementation only. Do NOT tweak the functionality of the reference. The same applies to config classes that use trust_remote_code (i.e., not available in transformers): copy a minimal faithful version into the test file. The modeling file should NOT import the config class — AD loads it at runtime via AutoConfig.from_pretrained(..., trust_remote_code=True). The test-only config copy lets you verify config-wrapping behavior (e.g., structure of state_dict).
Weight conversion helpers: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using load_state_dict pre-hooks already registered on the custom model.

Numerical comparison: For equivalence tests comparing custom ops against HF reference, use the shared assert_rmse_close utility from _model_test_utils:

from _model_test_utils import assert_rmse_close

Recommended rmse_ratio_tol values for bfloat16:

Identical math (MLP, Norm): use torch.testing.assert_close with tight rtol/atol (1e-3)
MoE block (fused routing): 0.02
Decoder layer / MoE layer / full model: 0.05
Attention: 0.10

Bottom-up levels (each must pass before next):

Block equivalence — Test MLP, Attention, MoE, Norm individually: same weights + same input → assert_rmse_close (or torch.testing.assert_close for identical-math blocks).
Layer equivalence — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
Full model equivalence — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
Export test — torch_export_to_gm with Dim.DYNAMIC for batch+seq, verify finite output, test a second shape.

Phase 7 — Independent Review (MANDATORY)

Invoke the ad-onboard-reviewer subagent with ONLY the following information:

Model name
Path to the model file created
Path to the test file created

Do NOT include your own assessment of correctness. Do NOT summarize what you did. Let the reviewer read the files and judge independently.

If the reviewer returns FAIL on any item:

Read the reviewer's specific failure reasons and file:line references
Fix each failed item
Invoke the reviewer again with the same minimal inputs
Repeat until you get a full PASS

Do NOT proceed to Phase 8 until the reviewer returns PASS.

Phase 8 — Create or Update Model Registry Entries (Including Family)

Before running the model end-to-end, ensure it and all identified family members from Phase 1 have valid entries in the AutoDeploy model registry at examples/auto_deploy/model_registry/.

For each model (the requested model + any family members identified in Phase 1 Step 2):

Check examples/auto_deploy/model_registry/models.yaml for an existing entry matching the model's HF id.
If the entry is missing, add it with the appropriate yaml_extra list:
- Always include dashboard_default.yaml first.
- Pick world_size_N.yaml based on model size (1 for <2B, 2 for 2-15B, 4 for 20-80B, 8 for 80B+). The world_size determines how many GPUs are needed for the run.
- Add model-specific YAML if the model needs custom settings (e.g., model_kwargs, non-default transforms).
If a model-specific config YAML is needed and doesn't exist, create it under examples/auto_deploy/model_registry/configs/. See existing configs for format examples.
If the entry exists but needs changes (e.g., wrong world_size, missing model-specific config), update it.

Family members that share the same architecture should all use the same modeling code. Different sizes only need different world_size_N.yaml entries and maybe different sharding configurations.

See examples/auto_deploy/model_registry/README.md for full documentation on the registry format and best practices.

Phase 9 — AutoDeploy End-to-End Run

⚠️ MANDATORY: You MUST use the standalone config YAML with `--args.yaml-extra` ⚠️

You MUST run the model using the standalone config YAML created in Phase 8. The same YAML will be referenced by the cookbook's trtllm-serve command in Phase 11. The command is:

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

If the run FAILS:

Fix the standalone config YAML — update settings in examples/auto_deploy/model_registry/configs/<model>.yaml and re-run.
The standalone config YAML is the source of truth. If it is wrong, fix it. If it is missing settings, add them. The model MUST work via this YAML before you are done.

Invoke the ad-run-agent subagent to run the model through AutoDeploy on GPU. Pass it:

Step 1: Reduced num layers Run with reduced num layers to test the e2e flow for issues and iterate faster. The generation will be bad in step 1 because we are not loading all layers.

Step 2: Full layers Run with full num layers. The generation should be coherent in step 2.

Model HF ID: the HuggingFace model-id (or local checkpoint path) used throughout onboarding
Standalone config YAML path: the path to the config YAML under examples/auto_deploy/model_registry/configs/
Description: a short description of the current state, e.g.:
- "first try after onboarding"
- "updated yaml with reduced layers"
- "changed attention backend to torch_mha"
- "fixed weight loading hooks"

The model is run via:

CUDA_VISIBLE_DEVICES=<SELECTED_GPUS> python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

The ad-run-agent will determine the required world_size from the config YAML, check GPU availability via nvidia-smi, select free GPUs, and wait if not enough are available.

The ad-run-agent will build+run the model, check generation quality, archive logs, and update its worklog.

If the run fails or produces bad generation:

Read the ad-run-agent's worklog and log file to understand the error
Fix the issue (model code, standalone config YAML, weight hooks, etc.)
Re-invoke the ad-run-agent with an updated description reflecting the change (e.g., "retry after fixing RoPE scaling in config")
Always re-run with --args.yaml-extra. Fix the standalone config YAML, don't work around it.
Repeat until the run succeeds with meaningful generation

Do NOT proceed to Phase 10 until the step 2 with full layers reports a successful run with coherent generation.

Important: The successful E2E run outputs (prompts and generated text) will be needed for the cookbook notebook in Phase 11 and the summary report in Phase 12. Save them.

Phase 10 — Update Model Support Matrix

After a successful E2E run, update the TensorRT-LLM model support matrix at docs/source/models/supported-models.md to include the newly onboarded model.

Read the current support matrix to understand the format and existing entries.
Add a row to the "Supported Models" table (the first table in the file) with:
- Architecture: The model's architecture class name (e.g., MiniMaxM2ForCausalLM) — use the class name registered in Phase 4.
- Model: The model family/display name (e.g., MiniMax M2/M2.1/M2.7).
- HuggingFace Example: A representative HF model ID (e.g., MiniMaxAI/MiniMax-M2.7).
- Place the new row alphabetically by architecture class name to keep the table sorted.
If the model is AutoDeploy-only (i.e., it does NOT have native PyTorch backend support in tensorrt_llm/_torch/models/), add a footnote indicating AutoDeploy support with a link to the AD config YAML, following the pattern of existing AD-only models (e.g., [^N]: Supported via the [AutoDeploy](../features/auto_deploy/auto-deploy.md) backend. See [AD config](../../../examples/auto_deploy/model_registry/configs/<model>.yaml).).
If the model warrants an entry in the Model-Feature Support Matrix (second table — typically for key/flagship models), add a row there too. For newly onboarded AD models, most advanced features should be marked Untested unless you have verified them. Use existing AD model entries (e.g., Glm4MoeLiteForCausalLM) as a reference for which features to mark as supported vs untested.

Phase 11 — Create AutoDeploy Cookbook

Create an AutoDeploy cookbook notebook for the model, following the pattern of existing cookbooks.

Use examples/auto_deploy/cookbooks/glm_4.7_flash_trtllm_cookbook.ipynb as the template. Copy its structure exactly.
Create the new notebook at examples/auto_deploy/cookbooks/{model_name}_trtllm_cookbook.ipynb, using a snake_case version of the model name (e.g., minimax_m2.7_trtllm_cookbook.ipynb).
Adapt all model-specific content:
- Title and description: update the model name, HF model ID, and description.
- Model Resources: update links to the model's HuggingFace card, blog posts, technical reports, API platform, and community links. Search the web or the model's HF card for relevant URLs.
- Model Highlights: update architecture details (e.g., MoE params, context length, special features like tool calling, interleaved thinking, etc.) from the model card.
- Prerequisites: update VRAM requirements based on model size and precision.
- trtllm-serve command: update the model ID and use --extra_llm_api_options pointing to the standalone AD config YAML under examples/auto_deploy/model_registry/configs/ (e.g., examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml). This is the same standalone config YAML validated in Phase 9 via build_and_run_ad.py --args.yaml-extra. It is self-contained — it includes all the settings trtllm-serve needs (compile backend, batch size, seq len, transforms, etc.).
- OpenAI client MODEL_ID: update to the correct HF model ID.
- Evaluation Parameters: update recommended inference parameters from the model's documentation/model card.
- Additional Resources: update all links to be model-specific.
Do NOT include cell outputs in the committed notebook — the notebook should be clean with no pre-run outputs, so users run it themselves. (Exception: if the model was already run and outputs were captured during Phase 9, you may include them for reference, but this is optional.)
Verify the notebook is valid JSON — malformed .ipynb files will not render on GitHub or in Jupyter.

Phase 12 — Summary Report

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final `build_and_run_ad.py --args.yaml-extra` run ⚠️

Print (not file) after completion:

Model overview + unique features
Tricky parts needing human review
Files created/modified (including any new registry configs)
Test results table (name | validates | PASS/FAIL)
Known limitations
Reviewer result (PASS + how many review iterations it took)
AD end-to-end run result (success/fail, number of iterations, final generation quality)
Registry entry added/updated in models.yaml and any new config YAMLs created
ALL raw prompts and their corresponding generated outputs from the final successful build_and_run_ad.py --args.yaml-extra run. Copy-paste the COMPLETE prompt→output pairs verbatim from the run log. Do NOT summarize, truncate, or paraphrase them. The user needs to see exactly what the model generated to judge quality.
Model support matrix update — confirm the row was added to docs/source/models/supported-models.md and which footnote (if any) was used.
AutoDeploy cookbook created — path to the new notebook file (examples/auto_deploy/cookbooks/<model>_trtllm_cookbook.ipynb).

Phase 13 — Prepare a Pull Request

ALL raw prompts and their complete generated outputs from the final successful build_and_run_ad.py --args.yaml-extra run. Copy-paste the COMPLETE prompt→output pairs verbatim — do NOT summarize, truncate, or paraphrase. The reviewer needs to see exactly what the model generated.
A reproducible command:

python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml

A detailed pytest command for the unit tests you added so they can be run by the reviewer as well. Make sure you have run this pytest command on the latest commit that you are pushing, and include these results in the PR.

⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️

Every single time you push changes to the PR — whether it is a new commit, a rebase, an amendment, a fixup, or any other update — you MUST:

Re-run build_and_run_ad.py --args.yaml-extra using the ad-run-agent subagent, exactly as in Phase 9. The code has changed, so previous run results are stale and invalid.
Re-run the full unit test suite (pytest <test_file> -v) for the model's test file created in Phase 6. Previous test results are stale and invalid after any code change.
Post ALL raw output from both runs as a PR comment:
- The COMPLETE prompt→output pairs from build_and_run_ad.py verbatim — do NOT summarize, truncate, or paraphrase.
- The COMPLETE pytest output verbatim — every test name, every PASSED/FAILED line, every error traceback if any. Do NOT summarize or cherry-pick.

Workflow for every PR update cycle:

Make the requested code changes
Commit the changes
Before pushing, always rebase onto the target branch to check for conflicts: git fetch upstream && git rebase upstream/main. If there are conflicts, resolve them before proceeding. Do NOT push without rebasing first — the branch must be up-to-date with the target branch.
Push (or force-push if rebase rewrote history)
Re-invoke the ad-run-agent to run build_and_run_ad.py --model <MODEL-ID> --args.yaml-extra examples/auto_deploy/model_registry/configs/<model>.yaml on the updated code
Re-run the unit tests: pytest <test_file> -v
Wait for both runs to complete
Post a reply to every PR comment containing:
- A brief description of what changed in this update
- The COMPLETE raw prompts and generated outputs from the build_and_run_ad.py run
- The COMPLETE raw pytest output (full verbatim log)
- The reproducible commands used for both runs
Resume polling for new comments (see below)

⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️

How to poll:

# Fetch all PR comments, sorted newest-first, and check for any posted after your last comment
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/pulls/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
# Also check issue-level comments (top-level PR comments, not inline review comments)
GH_CONFIG_DIR=<path> gh api "repos/<owner>/<repo>/issues/<PR_NUMBER>/comments?sort=created&direction=desc&per_page=10"
# Also check the PR's review status
GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state

Polling loop behavior:

After posting your PR (or posting an update comment), immediately start polling every 5 minutes.
On each poll, check for:
- New review comments (inline or top-level) posted after your last comment's timestamp
- PR approval status — check if the PR has been approved
- Termination signals — any comment clearly indicating the agent's work is done (e.g., "LGTM", "looks good, we're done", "no more changes needed", "agent work complete", or similar)
If new actionable comments are found: stop polling, process the feedback, and execute the full PR update cycle (steps 1–8 above). After posting the update, resume polling.
If the PR is approved or a termination signal is found: stop polling, report to the user that the PR review cycle is complete, and end.
If no new comments are found: sleep 5 minutes and poll again.

Sharding-aware IR model porting

For porting an existing custom model to a sharding-aware _ir.py variant, see the ad-sharding-ir-port skill.

Key Gotchas

Canonical ops first: Always use torch.ops.auto_deploy.torch_* canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.
No repeat_interleave: AD attention ops handle GQA natively. Never repeat K/V heads manually.
Lean code: Every line should serve prefill export. No optional HF features, no dead code paths, no fallback logic.
Reuse config classes: Import from transformers or load from checkpoint whenever possible. Only bundle a config class if it truly doesn't exist anywhere.
Assert position_ids: Always assert position_ids is not None — it is a required input, never optional.
Self-contained files only: Never import from other AD custom models. Each modeling_{name}.py is a standalone translation from HF source.
RoPE cos/sin: slice ONCE, not per layer. _ad_ prefix for RoPE buffers. RotaryEmbedding.forward(x, position_ids) MUST slice by position_ids once and return pre-sliced (cos, sin). Pass those tensors to all layers. NEVER pass position_ids through to each layer/attention forward to re-index — that is redundant compute that bloats the exported graph. See Phase 2 for the full pattern.
MoE weights: use nn.ModuleList per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
noaux_tc routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused trtllm kernels at deployment time.
Vision towers are typically not exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
Model code and tests must run on CPU. Use only torch_* prefixed reference ops in AutoDeploy — never triton_*, flashinfer_*, or trtllm_*.

ad-model-onboard

More from this repository

More from this repository

AutoDeploy Model Onboarding

Phase 0 — Gather All Resources Upfront

Step 0 — GPU memory sanity check

Phase 1 — Survey Existing Coverage & Analyze HF Model

Step 1 — Check for existing AD custom modeling code

Step 2 — Survey the model family in the registry

Step 3 — Analyze HF model architecture

Phase 2 — Write a Lean Prefill-Only Model

Phase 3 — Use AutoDeploy Canonical Ops (CRITICAL)

Phase 4 — Register

Phase 5 — Model Input Contract

Phase 6 — Hierarchical Tests

Phase 7 — Independent Review (MANDATORY)

Phase 8 — Create or Update Model Registry Entries (Including Family)

Phase 9 — AutoDeploy End-to-End Run

⚠️ MANDATORY: You MUST use the standalone config YAML with --args.yaml-extra ⚠️

Phase 10 — Update Model Support Matrix

Phase 11 — Create AutoDeploy Cookbook

Phase 12 — Summary Report

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final build_and_run_ad.py --args.yaml-extra run ⚠️

Phase 13 — Prepare a Pull Request

⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️

⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️

Sharding-aware IR model porting

Key Gotchas

AutoDeploy Model Onboarding

Phase 0 — Gather All Resources Upfront

Step 0 — GPU memory sanity check

Phase 1 — Survey Existing Coverage & Analyze HF Model

Step 1 — Check for existing AD custom modeling code

Step 2 — Survey the model family in the registry

Step 3 — Analyze HF model architecture

Phase 2 — Write a Lean Prefill-Only Model

Phase 3 — Use AutoDeploy Canonical Ops (CRITICAL)

Phase 4 — Register

Phase 5 — Model Input Contract

Phase 6 — Hierarchical Tests

Phase 7 — Independent Review (MANDATORY)

Phase 8 — Create or Update Model Registry Entries (Including Family)

Phase 9 — AutoDeploy End-to-End Run

⚠️ MANDATORY: You MUST use the standalone config YAML with --args.yaml-extra ⚠️

Phase 10 — Update Model Support Matrix

Phase 11 — Create AutoDeploy Cookbook

Phase 12 — Summary Report

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final build_and_run_ad.py --args.yaml-extra run ⚠️

Phase 13 — Prepare a Pull Request

⚠️ MANDATORY: Re-run and re-post logs on EVERY PR update — NO EXCEPTIONS ⚠️

⚠️ MANDATORY: Poll PR for new comments every 5 minutes ⚠️

Sharding-aware IR model porting

Key Gotchas

⚠️ MANDATORY: You MUST use the standalone config YAML with `--args.yaml-extra` ⚠️

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final `build_and_run_ad.py --args.yaml-extra` run ⚠️

⚠️ MANDATORY: You MUST use the standalone config YAML with `--args.yaml-extra` ⚠️

⚠️ MANDATORY: You MUST include ALL raw prompts and generated outputs from the final `build_and_run_ad.py --args.yaml-extra` run ⚠️