| name | trtllm-model-onboard-multimodal |
| description | Onboard a HuggingFace multimodal model (vision/audio/video + text) to the TensorRT-LLM PyTorch backend. Use when writing a new `tensorrt_llm/_torch/models/modeling_<vlm>.py` plus its input processor and weight mapper, or extending an existing VLM. Not for AutoDeploy — use `ad-model-onboard` for that path.
|
| license | Apache-2.0 |
| metadata | {"author":"NVIDIA Corporation"} |
TensorRT-LLM Multimodal Model Onboarding (PyTorch backend)
Scope. PyTorch backend only (tensorrt_llm/_torch/) — the default for LLM(..., backend="pytorch"), trtllm-serve, trtllm-bench. Not for AutoDeploy (tensorrt_llm/_torch/auto_deploy/); use ad-model-onboard for that.
Output:
tensorrt_llm/_torch/models/modeling_{name}.py — wrapper class (multimodal encoder + LLM) decorated with @register_auto_model, @register_vision_encoder, @register_input_processor (and @support_multimodal_disaggregated if EPD is supported), plus a BaseMultimodalInputProcessor (+ BaseMultimodalDummyInputsBuilder) subclass.
_torch/models/checkpoints/hf/{name}_weight_mapper.py if HF prefixes need surgery.
- Per-model unit test under
tests/unittest/_torch/modeling/test_modeling_<name>.py (subclass of TestModelingMultimodal); supplemental utility tests under tests/unittest/_torch/multimodal/ if needed; an accuracy test under tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py; support-matrix entry; verified trtllm-serve flow.
System map
Aggregated path (default)
[1] API event loop (server-side, async)
The chat handler wraps each image/video/audio URL part as an async_load_*
coroutine (not yet awaited). apply_chat_template builds the text prompt.
asyncio.gather then decodes all media for one request in parallel.
[2] Input pipeline (asyncio.to_thread, off the event loop)
BaseMultimodalInputProcessor.__call__:
HF AutoProcessor → pixel_values + token_ids
mm-token layout (positions / lengths / special_token_offsets)
(mRoPE) mrope_position_ids + deltas computed on CPU
_postprocess: HF mm token ids → tllm_multimodal_token_id (OOV sentinel)
The framework wrapper around your processor computes blake3 content hashes
for KV-cache reuse.
MultimodalParams.to_handle("multimodal_data") at the end → each tensor in
multimodal_data is replaced by a small dict pointing at its CUDA-IPC / shm
handle, so the broadcast in [3] carries pointers, not megabytes of pixels.
[3] Worker fan-out (TP / PP / CP)
Each worker rebuilds local tensor views via to_tensor("multimodal_data").
multimodal_input (hashes / positions / lengths) is forwarded to the C++
executor to drive KV-cache hash matching.
[4] Per-iteration staging (model engine)
Context: build MultimodalRuntimeData (positions / lengths / chunk bounds)
→ push pixel_values to CUDA pinned + non_blocking, obeying the
model's multimodal_data_device_paths declaration; pad
mrope_position_ids into a preallocated CUDA buffer.
Generation (mRoPE only): strip everything except mrope_position_deltas.
Post-prefill: drop mm_data so it doesn't ride along in decode.
[5] Model.forward(attn_metadata, input_ids, position_ids, multimodal_params=…)
get_multimodal_embeddings: runs encoder.forward only on params whose
multimodal_data["multimodal_embedding"] is empty (chunked-prefill iter
2+ hits the per-request cache; results written back automatically).
find_input_mm_embeds: slices the cached embedding to the current chunk
under chunked prefill / KV-cache reuse.
prepare_mrope_config (mRoPE models): one-shot mrope_rotary_cos_sin per
request from the staged mrope_position_ids buffer.
fuse_input_embeds: text + mm merged via precomputed indices
(with optional extra_embeds for multi-feature encoders).
self.llm.forward(inputs_embeds=..., mrope_config=...) → logits.
Key invariants:
- [1] and [2] both run off the API event loop. [1] fans media decode out across one request's items with
asyncio.gather; [2] is single-threaded per request because the HF processor is request-scoped.
- The producer hands off as handles at the end of [2], so the broadcast in [3] stays small (Contract 3).
- [4] is the only per-iteration GPU staging; H2D is
non_blocking=True from pinned host memory.
- [5] runs on the compute stream and must be sync-free (Contract 1).
EPD-disaggregated path
When @support_multimodal_disaggregated is set and the deployment uses TLLM_MULTIMODAL_DISAGGREGATED=1:
- Encoder worker: runs as a standalone
MultimodalEncoder (mm_encoder_only=True). It executes only the multimodal encoder and ships mm_embeddings (+ mRoPE position ids/deltas) to prefill+decode workers as shared-tensor handles.
- Prefill+decode worker: the model's
__init__ skips constructing self.mm_encoder when _is_disagg() is true; the input processor's attach_multimodal_embeddings() override binds the encoder handles into the request. For context-only requests, the engine re-clones mrope tensors so IPC handles outlive the encoder worker's freed memory — replicate that pattern for any new GPU-resident mm tensors.
Templates to study
modeling_qwen3vl.py, modeling_llava_next.py, and modeling_gemma3vl.py are the canonical references — fully-ported encoder, single-class wrapper, text_config-based LLM resolution. Other examples by modality: modeling_pixtral.py, modeling_phi4mm.py (audio), modeling_mllama.py, modeling_hyperclovax.py, modeling_mistral_large3.py. Pick the closest one (modality + LLM family + RoPE variant). modeling_qwen2vl.py retains an HF-passthrough vision tower for the outdated Qwen2-VL family — read it for context but don't copy that pattern.
Reuse before you write (the most important rule)
Every common block has a TRT-LLM implementation; compose, don't reimplement. Hand-rolled nn.Linear / nn.LayerNorm / nn.MultiheadAttention silently work in fp16/bf16 single-GPU eager and silently break under quantization, TP, attention-backend selection, KV cache, and CUDA graphs. Browse tensorrt_llm/_torch/modules/ before writing a layer; the reference VLMs (modeling_qwen3vl.py, modeling_llava_next.py, modeling_gemma3vl.py) show canonical wiring. Reuse is also where every future perf improvement lands automatically.
Compute modules
Mappings most often missed by adapters:
| Concern | Module | Non-obvious wiring |
|---|
| Linear | _torch.modules.linear.Linear | Pass mapping=model_config.mapping, tensor_parallel_mode=TensorParallelMode.{COLUMN,ROW,NONE}, allreduce_strategy=model_config.allreduce_strategy. Every quant scheme (FP8 / NVFP4 / W4A8 / AWQ / weight-only) is automatic — never substitute nn.Linear. |
| Attention (text and vision) | _torch.modules.attention.Attention (variants: qk_norm_attention.QKNormRoPEAttention for QK-norm + YARN; attention.MLA for DeepSeek-style) | Same module runs the LLM and the multimodal encoder. For the encoder side, build an ad-hoc attn_metadata per forward and pass predefined_attention_mask=PredefinedAttentionMask.FULL (or windowed). Reference: Qwen2_5_VLVisionAttention.prepare_attn_metadata. |
| MLP / Gated MLP | _torch.modules.mlp.MLP, _torch.modules.gated_mlp.GatedMLP, _torch.modules.swiglu.swiglu | GatedMLP covers the SwiGLU pattern (gate + up + silu + down) with fused gate/up weights — don't roll it from two Linears and F.silu. Plain MLP for non-gated cases. Both inherit the same TP / quant story as Linear. Reference: Qwen2_5_VLMLP. |
| RoPE | _torch.modules.rotary_embedding.{RotaryEmbedding, MRotaryEmbedding} | MRotaryEmbedding (mRoPE) is for the LLM side of mRoPE-using VLMs (Qwen-VL family), with mrope_section-aware cos/sin slicing and 3D position_ids. The encoder's internal 2D RoPE uses plain RotaryEmbedding. |
LLM backbone — reuse via AutoModel
The inner LLM is loaded via TRT-LLM's own AutoModelForCausalLM (tensorrt_llm._torch.models.modeling_auto), not transformers.AutoModelForCausalLM. It dispatches on pretrained_config.architectures[0] to whichever class is registered via @register_auto_model. The canonical wiring (using text_config to surface the inner LLM) lives in the Phase 2 template; if the inner LLM doesn't yet have a TRT-LLM modeling file, finish that text-only onboarding first.
Multimodal encoder — port to TRT-LLM modules
This is required, not a preference. Re-implement encoder blocks from _torch.modules.*; the encoder builds its own attn_metadata via prepare_attn_metadata. Reference: Qwen2_5_VisionModel, Qwen2_5_VLVisionAttention, Qwen2_5_VLPatchMerger. Two reasons:
- Performance. HF-eager runs on PyTorch's stock kernels — vanilla SDPA,
nn.LayerNorm, plain nn.Linear — losing TRT-LLM-attention / FlashInfer, fused RMSNorm, FP8/NVFP4/AWQ Linear, TP, and CUDA-graph capture for static-shape paths. For a 0.5–7 B encoder running every prefill, the regression compounds each iteration.
- Version coupling. Every
from transformers.models.<family> import <X> ties the modeling file to a specific transformers release. Upstream HF refactors (renamed classes, signature changes, internal helper migrations) silently break TRT-LLM imports months later, often surfacing only when users upgrade their environment. Porting cuts the dependency. The same applies to importing HF computations / helper functions, not just modules — keep both out of new modeling files.
The lone existing exception is Qwen2VLModel, which keeps Qwen2VisionTransformerPretrainedModel from transformers because Qwen2-VL is an outdated family on life support — not because passthrough is acceptable for new onboarding. Don't copy that pattern. If you genuinely cannot port (e.g. patching an existing legacy model), the HF import must carry a code comment explicitly justifying it; otherwise PR review will bounce.
Weight loading — reuse mappers
If HF prefixes don't match (model.vision_tower.* → mm_encoder.*, fused/un-fused QKV, etc.), inherit from a related mapper rather than ad-hoc translation. Reference: _torch/models/checkpoints/hf/qwen2vl_weight_mapper.py, qwen3vl_weight_mapper.py.
Host memory during init / weight loading. Large VLMs can blow past host RAM if every rank materializes the full state_dict before sharding. Two patterns from modeling_nemotron_nano.py:NemotronH_Nano_VL_V2 (PR #13283):
- Defer multimodal-encoder construction out of
__init__ and into load_weights() when the encoder contains HF submodules whose deterministic init ops (ones_, zeros_, fill_, .detach(), .to(dtype=...)) clash with the LLM's MetaInitMode fast path. Snapshot the multimodal ModelConfig in __init__ (since post_config() overwrites self.model_config.pretrained_config to the LLM-only config), construct the encoder + .to("cuda") inside load_weights(). Otherwise MetaInitMode raises and the entire model falls back to slow CPU init.
- Call
weights.mark_consumed(<prefix>) after each sub-module's load_weights(...) so the mmap-backed shards behind those weights can be released. Without it, peak host memory holds the entire checkpoint; with it, peak holds only the shard you're currently loading. Tag every prefix you've finished — encoder, sound, projector, LLM.
Don't touch
PyExecutor + the C++ core own AttentionMetadata, KV cache, scheduler, sampler, decoder. Your model receives attn_metadata and multimodal_params as inputs and returns logits — never builds request-level metadata. The only attn_metadata you build yourself is the multimodal encoder's own, on the synthetic per-item batch (concatenated patches with per-image seqlens, mel frames, etc.).
Performance contracts
Three rules. Multimodal prefill is long (image/audio tokens balloon sequence length) and media tensors are big (MBs–GBs); the overlap scheduler hides host work behind GPU work only if all three hold.
Contract 1 — Zero CPU-GPU syncs inside forward
A single sync inside forward collapses overlap, and per-iteration GPU work is long for VLMs, so stalls compound.
Banned in forward and anything it calls:
.item(), .tolist(), int(t), bool(t), float(t) on GPU tensors
t.cpu(), t.to("cpu"), any device-crossing read
- Python
if/while on tensor values (shape is fine; values are not)
torch.nonzero, single-arg torch.where(condition) (index form; documented sync hazard in filter_mm_token_from_input_ids when run on GPU input_ids), torch.unique, masked_select
torch.tensor([...], device="cuda") from a Python list (hidden H2D)
- HF runtime branches (
if pixel_values is None: ...) that change tensor shapes
Three-arg torch.where(cond, x, y) is fine when cond is built only on-device (no scalar readback). fuse_input_embeds: kwargs text_token_indices + mm_token_indices together ⇒ skip internal filter_*. trtllm-serve usually supplies both via model_engine.py (CPU-side index build → inputs → fuse_input_embeds(..., **kwargs)). Pure-text batches have no MM inputs; bare unit tests / direct calls may omit indices ⇒ in-model filter_* runs.
Patterns:
- Static graph for mixed batches. Don't add
if has_mm: branches. find_input_mm_embeds returns input unchanged when runtime is None; fuse_input_embeds returns (input_ids, None) when mm_embeds == [] — preserve that contract.
- mRoPE: compute once per request, never per layer. The pipeline (input processor → engine →
prepare_mrope_config) is laid out in the system map; the constraint here is that per-layer attention must read pre-sliced (cos, sin) — never recompute mrope inside the decoder loop.
Audit: grep for the banned constructs; run one prefill iteration with torch.cuda.set_sync_debug_mode("warn") and confirm zero warnings from your model.
Contract 2 — Preprocessing on CPU, async, server-side
CPU-bound work (decode / resize / normalize / mel-spectrogram / frame extraction) must not compete with GPU work, block the request loop, or serialize across requests.
- HF AutoProcessor + image_processor + tokenizer run inside
BaseMultimodalInputProcessor.__call__ — not in the model worker.
- URL/bytes media goes through
async_load_image / async_load_video / async_load_audio (all wrap blocking decode in asyncio.to_thread). Never call PIL.Image.open(...).load() / cv2.VideoCapture / soundfile.read synchronously on the request hot path.
- Pin host tensors before H2D with
prefer_pinned() (False under Confidential Compute (CC), True otherwise). The engine pins multimodal_data automatically via to_device(..., pin_memory=prefer_pinned()).
- Declare
multimodal_data_device_paths on the model — list of dotted paths (e.g. ["image.pixel_values", "image.image_grid_thw", "video.pixel_values_videos", "video.video_grid_thw", "multimodal_embedding"]) telling the engine which fields go to CUDA. Anything not listed stays on CPU.
- Optional (refactor pending):
get_text_with_mm_placeholders + expand_prompt_token_ids_for_mm enable the tokenized+MM fast path (tokenized_multimodal_process), skipping redundant detokenization. A cleaner alternative is being designed — skip unless you have a specific need.
- Forward
mm_processor_kwargs from inputs.get("mm_processor_kwargs", {}) to the HF processor (callers tune things like video sample rate via this).
Contract 3 — Large media via shared tensors, never raw pickle
A 1024×1024 fp32 patch tensor is ~12 MB; a video clip can be hundreds of MB. Naive pickle through MPI broadcast turns the leader into the IPC bottleneck.
- Always use
MultimodalParams.to_handle/to_tensor. to_handle swaps each tensor inside multimodal_data for a small dict — {method_key, tensor_size, storage_handle, ...} — that points at the same memory: a CUDA-IPC handle for GPU tensors (REBUILD_CUDA) or a POSIX-shm handle for CPU tensors (REBUILD_CPU). The dict is a few hundred bytes regardless of the original tensor size. Consumers call to_tensor to rebuild local tensor views from the handle. See _torch/shared_tensor/.
- Where it crosses ranks: the executor broadcasts
py_multimodal_data via dist.broadcast / tp_cp_broadcast / PP send-recv. Payload size = the literal byte size of whatever's in py_multimodal_data — confirm every tensor inside has been swapped for its handle dict (i.e. to_handle ran) before this point.
- Strip after prefill.
_strip_py_multimodal_data_post_prefill clears everything except mrope_config.mrope_position_deltas. If your model needs to retain something across decode, update strip_mm_data_for_generation explicitly.
- EPD disagg. Embeddings still cross workers as shared tensors, not bytes — see the EPD-disaggregated path section above for the encoder/prefill-worker split.
- Hashes are small; broadcast eagerly.
MultimodalInput.multimodal_hashes (blake3) drives KV-cache reuse — never substitute raw pixels for them.
Audit: payload size in NVTX broadcast_requests / tp_broadcast_requests ranges should be < 1 MB per rank per request. More means a broadcast leaked raw tensors.
Contract 4 — Batch the multimodal encoder across requests
get_multimodal_embeddings hands the encoder a list of MultimodalParams covering every uncached request in the current batch. The encoder must consume that list as a single batched forward pass — concatenate every request's pixel_values / image_grid_thw / mel frames into one tensor, build one ad-hoc attn_metadata whose seq_lens carries per-image boundaries, and run the encoder blocks once. Looping for p in mm_params: encoder.forward(p) loses kernel-launch coalescing and serializes N requests' worth of encoder work.
Pattern (Qwen2.5-VL). Qwen2_5_VisionModel concatenates every request's pixel_values into one [total_patches, ...] tensor and builds attn_metadata with batch_size=1 and seq_lens=[img1_patches, img2_patches, ...]. The TRT-LLM Attention module respects seq_lens so cross-image attention doesn't bleed. The patch merger / projector at the end then splits the result back per-request via torch.split over the same lengths (this is what _cache_multimodal_embeddings expects too).
Audit. Under load with several multimodal requests in one batch, the encoder kernels in nsys should appear as one wide block per iteration, not N narrow blocks. A fan of N narrow blocks means the encoder is being looped per request instead of batched — one of the easiest VLM perf regressions to introduce while refactoring.
Phases
Phase 0 — Gather resources
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
Confirm preprocessor_config.json and chat_template.json are pulled. Verify AutoProcessor.from_pretrained(model_path) loads. Estimate LLM + multimodal encoder params for VRAM sanity (multimodal encoders are often 0.5–7 B on top of the LLM).
Phase 1 — Survey existing coverage
Read config.json's architectures and model_type. If a _torch/models/modeling_*.py already claims that architecture via @register_auto_model, extend rather than create new. Identify the closest existing multimodal model and note which TRT-LLM modules it reuses.
Phase 2 — Model wrapper
Create tensorrt_llm/_torch/models/modeling_{name}.py. The default pattern below mirrors modeling_llava_next.py and modeling_gemma3vl.py — a single wrapper class that composes a multimodal encoder + an LLM resolved through AutoModelForCausalLM.from_config(text_config). The *ModelBase + *Model Base/non-Base split in modeling_qwen2vl.py and modeling_qwen3vl.py is an implementation detail for sharing one wrapper between two variants of the same family (Qwen2-VL ↔ Qwen2.5-VL; Qwen3-VL ↔ Qwen3-VL-MoE) — keep the wrapper a single class unless you have the same multi-variant need.
class {Name}VisionModel(nn.Module):
"""Multimodal encoder. Composes _torch.modules.{Attention,Linear,RMSNorm,GatedMLP,RotaryEmbedding}."""
def forward(self, multimodal_params: List[MultimodalParams]) -> torch.Tensor:
...
class {Name}Model(PreTrainedModel):
config_class = {Name}Config
def __init__(self, model_config: ModelConfig[PretrainedConfig], *args, **kwargs):
config = model_config.pretrained_config
super().__init__(config)
if hasattr(self, "llm"):
return
if not _is_disagg():
self.mm_encoder = {Name}VisionModel(model_config)
else:
self.mm_encoder = None
llm_model_config = copy.deepcopy(model_config)
llm_model_config.pretrained_config = model_config.pretrained_config.text_config
self.llm = AutoModelForCausalLM.from_config(llm_model_config)
self.model_config = model_config
self.post_config()
def post_config(self):
self.config = self.llm.config
self.model_config.pretrained_config = self.llm.config
@property
def vocab_size_padded(self) -> int:
return self.llm.vocab_size_padded
def infer_max_seq_len(self) -> int:
return self.llm.infer_max_seq_len()
@torch.inference_mode()
def forward(
self,
attn_metadata: AttentionMetadata,
input_ids: Optional[torch.IntTensor] = None,
position_ids: Optional[torch.IntTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
return_context_logits: bool = False,
**kwargs,
) -> torch.Tensor:
num_context_requests = attn_metadata.num_contexts
multimodal_params = kwargs.get("multimodal_params", [])
mm_embeds = []
if len(multimodal_params) > 0 and not _is_disagg():
mm_embeds = get_multimodal_embeddings(
encoder_forward_fn=self.mm_encoder.forward,
multimodal_params=multimodal_params[:num_context_requests],
)
mm_embeds = find_input_mm_embeds(
mm_embeds, multimodal_params[:num_context_requests])
input_ids, inputs_embeds = fuse_input_embeds(
self.llm.model.embed_tokens, input_ids, mm_embeds, **kwargs)
return self.llm.forward(
attn_metadata=attn_metadata, input_ids=input_ids,
position_ids=position_ids, inputs_embeds=inputs_embeds,
return_context_logits=return_context_logits)
@property
def multimodal_data_device_paths(self) -> List[str]:
return ["image.pixel_values", "image.image_grid_thw", "multimodal_embedding"]
Required (every multimodal model):
forward takes multimodal_params via **kwargs. Never add pixel_values / image_grid_thw / attention_mask as direct args — they live in multimodal_params.multimodal_data.
- Encoder output length must match the input processor's MM placeholder count.
mm_encoder.forward must return a single tensor whose first dimension equals the total number of MM tokens (excluding special tokens) the input processor placed in prompt_token_ids. If lengths don't agree — or if the encoder returns a list with more than one element — get_multimodal_embeddings silently skips caching the embedding back into multimodal_data, and chunked prefill re-runs the encoder from scratch on every chunk.
Family-specific extras (apply only when relevant):
- mRoPE (Qwen-VL family): add
init_mrope_embedding(model_config) in __init__ to preallocate self.mrope_position_ids_padding_cuda, plus prepare_mrope_config(multimodal_params, num_context_requests) returning mrope_rotary_cos_sin. Pass through to self.llm.forward(..., mrope_config=...). Reference: Qwen3VLModelBase.prepare_mrope_config.
- Deepstack features (Qwen3-VL): split encoder output into
mm_embed + deepstack_embeds, call fuse_input_embeds(..., extra_embeds=deepstack_embeds), forward deepstack_embeds= into the LLM.
- HF wrapper without a clean
text_config: Qwen2-VL's Qwen2VLModelBase rewrites architectures to surface the inner LLM. Fall back to that pattern only when the multimodal HF config does not expose a text_config sub-config.
- Inner LLM that doesn't match HF's
text_config schema (Qwen3.5-MoE-VL → Qwen3Next). When the VLM's HF text_config schema differs from the TRT-LLM runtime model you want to reuse, write a config normalizer (e.g. _normalize_qwen35_moe_vl_config) that maps HF aliases to the runtime's expected names (mRoPE keys, intermediate_size aliases, quantization-exclude module paths). Wire it via lazy import from pyexecutor.config_utils.load_pretrained_config — the Mistral and Qwen3_5 branches are templates. Two gotchas: transformers 5.x's rope_scaling is a property aliasing rope_parameters — setting either silently overwrites the other, so the normalizer should mutate rope_parameters directly if the HF code still reads from it. And for VLMs, the normalizer must run on the composite config (with text_config / vision_config), not flattened away.
- Thin wrapper for runtime reuse. Even when the LM class body is identical to the runtime's existing class, still create a
@register_auto_model("YourArch")-decorated thin subclass — that's how weight-mapper dispatch picks the family-specific mapper. You can't stack two @register_auto_model decorators on a single shared class.
Phase 3 — Input processor + dummy builder
Subclass both BaseMultimodalInputProcessor (drives every real request) and BaseMultimodalDummyInputsBuilder (drives engine warmup / profiling — the base shrinks dummy image resolution until the synthetic prompt fits input_seq_len). Colocate in the modeling file. Reference: Qwen3VLInputProcessorBase.
__call__(inputs, sampling_params) does:
- Pull
text_prompt, mm_data, mm_processor_kwargs from inputs.
_preprocess(...) — HF processor produces pixel_values / pixel_values_videos / *_grid_thw / input_ids.
- Build
multimodal_data keyed by modality: {"image": {"pixel_values": ..., "image_grid_thw": ...}, "video": {...}}.
- Compute
mrope_config on CPU (.to("cpu").clone()) into multimodal_data["mrope_config"]. Required even on text-only Qwen-VL prompts — no branch.
_postprocess(input_ids) rewrites HF's image_token_id / video_token_id to tllm_multimodal_token_id = vocab_size + 1 (the OOV sentinel). Skip when mm_data is empty.
- Return
(prompt_token_ids_list, {"multimodal_data": multimodal_data}).
Optional overrides (refactor pending; skip unless needed): get_text_with_mm_placeholders(mm_counts) + expand_prompt_token_ids_for_mm(prompt_token_ids, num_mm_tokens, ...) enable the tokenized fast path. A cleaner replacement is being designed.
EPD override (if @support_multimodal_disaggregated): attach_multimodal_embeddings(inputs, multimodal_embedding, sampling_params) consumes encoder outputs in the prefill+decode worker.
Decorator stack — bottom-up application; register_vision_encoder requires register_auto_model to have run first:
@support_multimodal_disaggregated
@register_vision_encoder({Name}VisionModel, vlm_base_model=HFVisionTransformerClass)
@register_auto_model("{ArchName}ForConditionalGeneration")
@register_input_processor(
{Name}InputProcessor, model_type="{model_type}",
placeholder_metadata=MultimodalPlaceholderMetadata(
placeholder_map={"image": "<|vision_start|><|image_pad|><|vision_end|>", ...},
placeholder_placement=MultimodalPlaceholderPlacement.BEFORE_TEXT,
placeholders_separator="",
content_format=ContentFormat.STRING,
),
)
class {Name}Model(PreTrainedModel): ...
Phase 4 — Weight loading
def load_weights(self, weights, weight_mapper):
if not _is_disagg():
self.mm_encoder.load_weights(weights)
if hasattr(weights, "mark_consumed"):
weights.mark_consumed("vision_model")
weight_mapper = {Name}HfWeightMapper()
weight_mapper.init_model_and_config(self.llm, self.model_config)
filtered = {k: v for k, v in weights.items() if not k.startswith("model.visual.")}
self.llm.load_weights(filtered, weight_mapper)
if hasattr(weights, "mark_consumed"):
weights.mark_consumed("language_model")
Inherit from a related mapper for prefix surgery — don't write a one-off translator.
Phase 5 — Tests
Per-model unit test (the main one) at tests/unittest/_torch/modeling/test_modeling_<name>.py. Subclass TestModelingMultimodal from tests/unittest/_torch/modeling/test_modeling_multimodal.py (an abstract unittest.TestCase) and implement six abstract methods: get_model_config, get_trtllm_model_class, get_hf_model_class, get_weight_mapper_class, get_model_type, get_model_config_class. The base class drives a MultimodalScenario-parameterized run (modality ∈ image / multiple_image / video / text / mixture_text_image / audio, with optional use_cuda_graph / chunked_prefill / kv_cache_reuse) — comparing TRT-LLM logits to HF reference, exercising the KV cache manager, attn metadata, mrope, fusion path, and CUDA graph capture in one harness. Override get_scenarios to declare which combinations apply to your model. Reference: test_modeling_qwen3vl.py, test_modeling_qwen2_5vl.py, test_modeling_nemotron_nano_v2_vl.py. Test data lives under ${LLM_MODELS_ROOT}/multimodals/test_data/.
Hybrid linear-attention models. Override _dummy_request_kwargs to return {"use_mrope": True} if the model uses mRoPE (allocates the 3-D position-id buffer at dummy-request time). The base class's init_kv_cache_manager already dispatches on is_qwen3_hybrid / is_nemotron_hybrid to build CppMambaHybridCacheManager — don't override unless you need a different concrete manager. Use PyKvCacheConfig from llmapi.llm_args (Pydantic), not the C++ bindings KvCacheConfig — CppMambaHybridCacheManager.__init__ reads mamba_state_cache_interval which only exists on the Pydantic side. CUDA-graph capture in the harness doesn't currently address the Mamba SSM state buffer — keep use_cuda_graph=False in get_scenarios for hybrid models until that's wired through; production CUDA-graph support is independent and unaffected.
Synthetic-config shape couplings. head_dim × partial_rotary_factor / 2 == sum(mrope_section) — head_dim can't be shrunk independently. If the test loads the real tokenizer via _name_or_path, vocab_size must equal the real tokenizer's vocab — otherwise chat-template specials at ids >= your synthetic vocab_size get misclassified as mm tokens by fuse_input_embeds's OOV filter (manifests as "found N image tokens but received M image embeddings", off by exactly the number of chat-template specials). Vision deepstack indices [i, j, k] require depth > k — the HF processor reserves placeholder tokens for deepstack outputs regardless of whether the encoder is configured to emit them.
Two-config Approach B for tests. If you've added a config normalizer (Phase 2), keep self.hf_config raw and route a deepcopy + normalize only through create_trtllm_model. Reusing one normalized config for both HF and TRT-LLM construction trips the transformers 5.x property aliasing and silently corrupts HF-side schema (rope_scaling ↔ rope_parameters).
Tolerance band. Default get_tolerance returns 0.4 / 0.4, calibrated to pass for the existing VLM tests but wide enough to mask argmax-changing bugs. After your test passes, dial it tighter — keep atol = 0.4 to absorb single-logit tail outliers seen on multiple_image / video scenarios; tighten rtol toward 0.1 to gate bulk-of-logits relative agreement. Don't drop rtol below 0.05 without cross-SKU validation.
Supplemental utility tests (only if your model exercises new logic in shared utilities) under tests/unittest/_torch/multimodal/: test_fuse_input_embeds.py, test_multimodal_runtime.py, test_find_num_image_tokens.py, test_external_embedding.py, test_share_multiparams.py, test_mm_encoder_standalone.py. Extend the existing tests rather than creating new files when the coverage is generic.
Accuracy test at tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py — subclass LlmapiAccuracyTestHarness, set MODEL_NAME / MODEL_PATH / MAX_NUM_TOKENS=16384, run MMMU (or ChartQA/ScienceQA). Reference: TestQwen2_5_VL_7B. Wire into tests/integration/test_lists/test-db/l0_<gpu>.yml.
- First-run reference capture. Set
TRTLLM_ACCURACY_NO_REFERENCE=1 for the first local run; the harness synthesizes a baseline reference (0 for higher-is-better metrics like MMMU), runs end-to-end, and prints the achieved accuracy. Paste the printed value verbatim into tests/integration/defs/accuracy/references/mmmu.yaml — that's the measured reference; the threshold derives from it via sigma / alpha / beta.
quant_algo assertion in test_fp8_prequantized must match what the checkpoint actually advertises. Flat per-tensor FP8 is QuantAlgo.FP8; block-scaled FP8 (DeepSeek-V3 / Qwen3.5 style) is QuantAlgo.FP8_BLOCK_SCALES. Same applies to NVFP4 variants. Easy to copy from a peer model and assert the wrong one.
Be parsimonious. The cartesian product modality × use_cuda_graph × chunked_prefill × kv_cache_reuse explodes fast. In get_scenarios(), pick the smallest set covering this model's distinctive paths — e.g. one image, one mixture_text_image, plus one chunked-prefill / one cuda-graph entry only if the model claims those features. One accuracy benchmark per model (MMMU for image VLMs); add another only for capabilities the first doesn't exercise (audio, video, very long context).
Phase 6 — Docs + serve verification
docs/source/models/supported-models.md:
- Supported Models table: row alphabetical by architecture class.
- Multimodal Feature Support Matrix (PyTorch Backend): row with columns Overlap Scheduler / CUDA Graph / Chunked Prefill / Torch Sampler / TLLM C++ Sampler / KV Cache Reuse / Logits Post Processor / EPD Disaggregated Serving / Modality (L+I+V+A). Mark
Yes only what you've verified.
First line of defense — quickstart smoke test. Before bringing up a server, run the bundled quickstart against your model:
python examples/llm-api/quickstart_multimodal.py \
--model_dir <hf_model_id> --modality image \
--media <url-or-path>
It exercises setup_llm + default_multimodal_input_loader + the chat template + LLM.generate end-to-end with a couple of bundled prompts. Cheaper than spinning up trtllm-serve and fails fast on input-processor / encoder / fusion bugs. Run for every modality your model supports (--modality image|video|audio|image_audio|...).
Then aggregated serving:
trtllm-serve <hf_model_id> --backend pytorch --max_num_tokens 16384 --port 8000
Send a chat completion with a real image; confirm coherent output. (TODO: 2ez4bz to provide ready-to-use curl examples.)
Chunked-prefill cache verification (mandatory). Re-run with a deliberately small --max_num_tokens to force the prefill of one image to span multiple chunks, then grep the server log for these two lines:
Multimodal hashing failed: → the input processor's hash path fell back; KV-cache reuse across requests with the same image is broken (Contract 3 hash invariant).
Multimodal runtime data missing or incomplete, will not cache embeddings. → the encoder-output cache is being skipped, so the encoder is being re-run on every chunked-prefill iteration of the same request (Phase 2 Required: encoder output length must match MM placeholder count).
A clean serving log shows neither line. If either appears, fix it before declaring the model done — these are silent perf cliffs, not crashes.
For EPD: run MultimodalEncoder and LLM as separate process groups; verify embeddings cross via disaggregated_params.multimodal_embedding_handles.
Phase 7 — Pull request
Follow CONTRIBUTING.md. Title [JIRA/NVBUG/None][type] description, git commit -s. Body: one full multimodal prompt → output verbatim, reproduction commands, pytest output verbatim. Trigger CI via /bot run.
Pre-PR checklist
Architecture & registration
Module reuse
Input processor
Performance contracts
Tests & docs