en un clic
adding-model-support
// Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
// Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.
Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GRPO/SFT/checkpoint/non-colocated refit variants, setting PYTHONPATH so NeMo-RL imports the local Bridge tree, and reporting pass/fail evidence.
| name | adding-model-support |
| description | Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples. |
| when_to_use | User asks to add, onboard, or integrate a new model family; 'add Qwen4 support', 'onboard Llama 5', 'create a bridge for X', 'write a recipe for Y'. |
Ask the user for the HuggingFace model link (e.g. https://huggingface.co/Qwen/Qwen3.5-VL-27B).
If the model is not public, ask the user to provide the config.json file directly.
Read the model's config.json from HuggingFace (or from the user-provided file). Key fields to extract:
model_type — used for @register_bridge(model_type=...)architectures — the HF model class name (used for source=... in registration)tie_word_embeddings — critical for weight tyingnum_hidden_layers, hidden_size, intermediate_size, num_attention_heads, num_key_value_heads, vocab_size, max_position_embeddings, rope_theta, etc.num_local_experts, num_experts_per_tok, moe_intermediate_sizeq_lora_rank, kv_lora_rank, qk_nope_head_dim, qk_rope_head_dimIf there are config fields you don't recognize from previously supported models (check CONFIG_MAPPING in model_bridge.py and existing bridges), this likely indicates a new architectural block (e.g., a novel attention variant, custom normalization, or a new layer type). Ask the user to provide the HuggingFace modeling_*.py implementation of that block so you can understand the computation and create the correct Megatron-side mapping or custom module.
VLM (Vision-Language Model) if config.json contains:
text_config AND vision_config sub-configsLLM (Text-only) if:
text_config / vision_configThis distinction affects:
text_config vs top-level for VLMs)Inspect the HF checkpoint's model.safetensors (or model.safetensors.index.json) for quantized
weight dtypes such as float8_e4m3fn (FP8) or uint8/uint4 with accompanying *_scale_inv or
*_scale tensors. Common signs:
config.json mentions quantization_config or dtype fields like "torch_dtype": "float8_e4m3fn"weight_scale_inv keys alongside the main weight keysWhy this matters: The bridge's import_ckpt path does not automatically dequantize — it
loads raw quantized values as-is. This produces a silently broken model (random-level loss, huge
grad norms) instead of raising an error.
Fix: Dequantize before conversion. Two approaches:
Standalone script (recommended for user-facing models) — Write a
dequant_fp8_for_bridge.py in the model's examples folder.
Reference: examples/models/mistral/ministral3/dequant_fp8_for_bridge.py.
The pattern is: w_bf16 = fp8_weight.to(bfloat16) * weight_scale_inv.
In-bridge hook — Override maybe_modify_loaded_hf_weight() in the bridge class to
dequantize on the fly during import:
def maybe_modify_loaded_hf_weight(self, hf_param, hf_state_dict):
weight = hf_state_dict[hf_param]
scale_key = hf_param + "_scale_inv"
if weight.dtype == torch.float8_e4m3fn and scale_key in hf_state_dict:
return weight.to(torch.bfloat16) * hf_state_dict[scale_key].to(torch.bfloat16)
return weight
Always add a sanity check in the verification workflow (e.g., print std of a weight tensor —
quantized models typically have std ≈ 13 before dequantization vs std ≈ 0.006 after).
LLM — Reference: Qwen2 (src/megatron/bridge/models/qwen/qwen2_bridge.py)
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py # Config + weight mappings (no provider file needed)
└── modeling_<model>/ # (optional) Custom nn.Module implementations if needed
└── ...
VLM — Reference: Qwen3.5-VL (src/megatron/bridge/models/qwen_vl/)
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py # Config + weight mappings
├── <model>_provider.py # Only for VLMs that need custom provide()
└── modeling_<model>/ # If using Megatron vision encoder
├── __init__.py
└── model.py # Combines vision + language
OR with HF vision encoder (Reference: Gemma3-VL):
src/megatron/bridge/models/<model>/
├── __init__.py
├── <model>_bridge.py
├── <model>_provider.py # Only for VLMs that need custom provide()
└── modeling_<model>.py # HF vision + Megatron language wrapper
Model-specific modeling code: If the model requires custom nn.Module implementations
(e.g. a custom RoPE variant, non-standard transformer config, custom thinker/talker
architecture), place them in a modeling_<model>/ directory or a single modeling_<model>.py
file inside the model family folder. Use a directory when there are multiple files (model,
transformer config, custom ops); use a single file when one module suffices. Never put
model-specific modeling code in shared directories or as loose files in the bridge family
directory — keep them namespaced under the modeling_<model> prefix.
LLM:
provider_bridge() and mapping_registry().
The bridge calls super().provider_bridge() to get a GPTModelProvider from CONFIG_MAPPING,
then sets model-specific attributes on it. Do not create a provider file — the stock
provider returned by super().provider_bridge() is usually sufficient for LLMs
(e.g., GPTModelProvider, or another base provider selected via PROVIDER_CLASS).
Do not add size-specific provider classes whose names combine
ModelProvider with a model-size suffix. Examples of forbidden suffixes
include 7B, 200M, and A3B. Model size and architecture fields should
come from the Hugging Face config through AutoBridge /
MegatronModelBridge config mapping. If a recipe needs a fixed
architecture, configure the base provider inside the recipe function instead
of exporting a provider subclass.VLM:
provide() to instantiate a
combined vision+language model need a provider subclass. The bridge manually calls
hf_config_to_provider_kwargs(text_config) and instantiates the custom provider.For detailed patterns, see:
tie_word_embeddings for VLMsFor VLMs, tie_word_embeddings lives on the top-level HF config, NOT on text_config. Always read from the parent config:
provider.share_embeddings_and_output_weights = getattr(hf_config, "tie_word_embeddings", False)
When reading HF config for VLMs, check whether each field is in:
hf_config (top-level) — e.g. tie_word_embeddings, image_token_id, video_token_idhf_config.text_config — e.g. num_hidden_layers, hidden_size, etc.hf_config.vision_config — e.g. vision encoder dimensionsWhen a new model introduces custom or non-standard layers (novel attention variants, custom
normalization, fused expert layouts, MTP heads, etc.), keep all model-specific logic inside
the model family directory. Do not modify shared files in src/megatron/bridge/models/conversion/
(e.g. param_mapping.py, model_bridge.py, quant_mapping.py) unless the change is genuinely
reusable across multiple model families.
Principle: The bridge and provider files for a model family are your primary extension surface. Shared conversion infrastructure provides hooks and base classes — subclass them locally rather than adding conditionals to shared code.
If the model has a layer whose weight layout doesn't match any existing mapping class, create a
private mapping class in the bridge file or a <model>_mappings.py file in the family directory.
Example — GLM's fused expert down-projection disables grouped-export transpose:
# src/megatron/bridge/models/glm/glm_moe_mappings.py
class GLMExpertDownProjMapping(FusedExpertMapping):
def __init__(self, megatron_param, hf_param, permute_dims=None):
super().__init__(megatron_param, hf_param, permute_dims, transpose_on_export=False)
Example — Nemotron-H's MTP layers flatten indices during resolve:
# Inside nemotron_h_bridge.py (private to the module)
class _MTPFlatteningMapping(MegatronParamMapping):
def resolve(self, captures):
return AutoMapping(self._flatten(captures), ...)
Example — MiniMax-M2's non-standard QK norm layout:
# Inside minimax_m2_bridge.py (private to the module)
class _FullDimQKNormMapping(MegatronParamMapping):
def hf_to_megatron(self, hf_weights):
# Custom scatter logic for full-dim QK norm
...
def megatron_to_hf(self, megatron_weights):
# Custom gather logic
...
MegatronModelBridge provides several override hooks — use them instead of modifying the base class:
| Hook | When to use |
|---|---|
mapping_registry() | Define all weight name mappings (abstract, always overridden) |
provider_bridge() | Configure the provider with model-specific flags (call super() then setattr) |
maybe_modify_loaded_hf_weight() | Dequantize, rename, or reshape HF weights before conversion |
maybe_modify_converted_hf_weight() | Synthesize extra HF keys on export (e.g. inv_freq) |
megatron_to_hf_config() | Build HF config.json for export |
hf_config_to_provider_kwargs() | Override CONFIG_MAPPING behavior for specific fields |
Accessing HF config in mapping_registry(): The bridge instance has self.hf_config
available during conversion — it is set automatically by the dispatch system before
mapping_registry() is called. Use it when your mapping registry needs config-dependent
logic (e.g. dynamic MTP layer count, number of experts):
def mapping_registry(self) -> MegatronMappingRegistry:
hf_config = getattr(self, "hf_config", None)
num_mtp_layers = getattr(hf_config, "num_nextn_predict_layers", 0) if hf_config else 0
...
Do not override build_conversion_tasks() to stash self._hf_config — that pattern is
deprecated.
Most models do not need a provider file — the stock provider (e.g., GPTModelProvider, or
another base selected via PROVIDER_CLASS) is usually sufficient for LLMs. Only create a provider subclass when a VLM needs custom provide() logic to instantiate
a combined vision+language model:
# src/megatron/bridge/models/<model>/<model>_provider.py
class MyVLModelProvider(GPTModelProvider):
image_token_id: int = 0
def provide(self, ...):
# Custom model construction combining vision encoder + language decoder
...
The bridge then references it via PROVIDER_CLASS = MyVLModelProvider or instantiates it directly
in provider_bridge().
Modify param_mapping.py or model_bridge.py only when the pattern is reusable by 2+ model
families. Examples of justified shared changes:
FusedExpertMapping / FusedGatedExpertMapping — used by GLM, DeepSeek, OLMoE, etc.RMSNorm2ZeroCenteredRMSNormMapping — used by Gemma, Nemotron, etc.CONFIG_MAPPING entries — when a standard HF config key maps to a standard provider attributeIf you're tempted to add a model-specific if model_type == "..." branch in shared code, or
pattern-matching on specific weight names in shared conversion logic, that's a signal to use a
local subclass or hook override instead.
If the model introduces a new computational block that differs from standard attention or MLP
(e.g., Gated DeltaNet / GDN linear attention, Multi-Token Prediction / MTP heads, Mamba SSM layers),
update the FLOPs calculator in src/megatron/bridge/training/utils/flop_utils.py so that
training throughput metrics (TFLOPs/GPU) are accurate.
When to update: Any time the new block has different FLOPs-per-token than standard self-attention or standard MLP. Common cases:
O(s²) attention term with the
block's actual operation countHow to update:
transformer_flops() function in flop_utils.py to understand the structure.experimental_attention_variant,
mtp_num_layers). Follow the existing MoE pattern for config validation — raise on invalid
types, assert list lengths, and use direct attribute access instead of getattr with fallback
defaults so that misconfigurations fail explicitly.tests/unit_tests/training/utils/test_flop_utils.py that verify:
linear_attention_freq) changes FLOPsReference PR: #2925 — GDN FLOPs calculator adds GDN support with both the calculator code and comprehensive tests.
Recipes provide pre-configured training settings for each model size.
LLM recipes: src/megatron/bridge/recipes/<family>/<model>.py
VLM recipes: src/megatron/bridge/recipes/<family>/<model>.py
Each recipe file defines functions for each model size + training mode:
<model>_<size>_sft_config() — Full supervised fine-tuning<model>_<size>_peft_config() — LoRA/DoRA parameter-efficient fine-tuning<model>_<size>_pretrain_config() — Pretraining (LLM only, usually)For detailed recipe patterns, see @skills/adding-model-support/recipe-patterns.md.
Recipes are the right API surface for model-size presets. Do not create or
export size-specific provider subclasses for recipes; either call
AutoBridge.from_hf_pretrained(...).to_megatron_provider(load_weights=False) to
derive the provider from HF config, or instantiate the base provider class with
explicit architecture fields inside the recipe function.
__init__.py — import and add to __all__src/megatron/bridge/recipes/__init__.py — wildcard importtrain_any_basic.py — add to config_map, docstring, and --model choicestests/unit_tests/models/<model>/
├── __init__.py
├── test_<model>_bridge.py # Mock HF config → verify provider mapping
└── test_<model>_provider.py # (optional) Only if custom provider subclass exists
tests/functional_tests/models/<model>/
├── __init__.py
├── test_<model>_conversion.py # Toy model HF↔Megatron roundtrip
└── test_<model>_provider.py # compare_provider_configs (optional)
For detailed test patterns, see @skills/adding-model-support/tests-and-examples.md.
Model examples: examples/models/<family>/<model>/
examples/models/<family>/<model>/
├── README.md
├── conversion.sh # HF↔Megatron conversion commands (real model)
├── inference.sh # Generation commands (real model, reasonable output)
├── slurm_sft.sh # SFT training on SLURM
└── slurm_peft.sh # PEFT training on SLURM
Key deliverable requirement: conversion.sh and inference.sh must target a real published model (e.g. Qwen/Qwen3-8B, not a toy). The inference script must produce reasonable output — for LLMs a coherent text continuation, for VLMs a plausible image description. This is the acceptance bar: conversion runs cleanly and generation makes sense.
Add a model page at docs/models/<type>/<model>.md covering:
After implementing bridge support, prompt the user to run these commands on the cluster:
uv run python -c "
from megatron.bridge import AutoBridge
bridge = AutoBridge.from_hf_pretrained('<org>/<model>')
provider = bridge.to_megatron_provider()
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)
bridge.load_hf_weights(model)
for i, (name, tensor) in enumerate(bridge.export_hf_weights(model, cpu=True)):
print(name, tuple(tensor.shape))
if i > 10: break
"
uv run python examples/conversion/convert_checkpoints.py import \
--hf-model <org>/<model> \
--megatron-path /workspace/<model> \
--torch-dtype bfloat16
uv run python examples/conversion/convert_checkpoints.py export \
--hf-model <org>/<model> \
--megatron-path /workspace/<model>/iter_0000000 \
--hf-path /workspace/<model>-hf-export
For LLMs:
uv run python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path <org>/<model> --prompt "Hello"
For VLMs:
uv run python examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path <org>/<model> \
--image_path "https://example.com/image.jpeg" \
--prompt "Describe this image."
uv run python -m pytest tests/unit_tests/models/<model>/ -v
uv run python -m pytest tests/functional_tests/models/<model>/ -v --run-gpu
User wants to add a model
│
├─ Has HF link? ─── No ──→ Ask for link (or config.json if private)
│
├─ Has text_config + vision_config? ─── Yes ──→ VLM path
│ ├─ Has Megatron vision encoder? ──→ Megatron encoder (Qwen3.5 pattern)
│ └─ No Megatron encoder ──→ HF encoder (Gemma3 pattern)
│
└─ No vision config ──→ LLM path (bridge only, no provider file)
├─ Standard GPT-style? ──→ Bridge with stock mappings
└─ Custom layers? ──→ Bridge + local mapping subclasses / hook overrides
├─ Custom weight layout? ──→ Local mapping subclass in family dir
└─ Custom import/export? ──→ Override bridge hooks (maybe_modify_*)