with one click
nemo-mbridge-perf-moe-vlm-training
Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments.
Menu
Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments.
| name | nemo-mbridge-perf-moe-vlm-training |
| description | Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments. |
| license | Apache-2.0 |
| when_to_use | Training MoE VLMs, or investigating a commit that caused MoE VLM training failure or OOM; 'MoE VLM', 'multimodal MoE', 'Qwen3-VL training', 'FSDP vs 3D-parallel for VLM', 'MoE vision language model'. |
Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-vlm-training/card.yaml
| Approach | Strength | Best fit |
|---|---|---|
| FSDP | Simplest path to a working multimodal run | first bring-up, memory-first tuning, awkward PP boundaries |
| 3D parallel | Higher ceiling after tuning | stable models with a clean PP layout and time for deeper sweeps |
For MoE VLMs, the practical workflow is usually:
The main patterns were consistent across the tracker:
Mock-data VLM runs are not trustworthy performance proxies. In the experiments, image-free mock runs looked closer to "roughly twice as fast" than "slightly optimistic" when compared with real multimodal input.
Use real or realistic image payloads before drawing any conclusion about VLM throughput.
The smaller Qwen3.5-style multimodal experiments reinforce the same lessons:
Freeze the vision stack when appropriate: if the work is decoder-focused, freezing the vision side often gives a small but real throughput gain and reduces memory pressure.
Sweep MBS aggressively: VLMs are more MBS-sensitive than text-only MoE runs because the vision path changes the compute-to-overhead balance.
Prefer selective recompute once the model fits: full recompute is a useful bring-up tool, but selective recompute is usually the better steady state.
Match CUDA-graph scope to the workload: attn moe_router moe_preprocess
is the safer MoE default, while narrower scopes can still be useful for
controlled experiments.
Use ETP only when EP alone is insufficient: it can unlock a layout, but it also introduces more communication and more tuning surface.
TP=1 CP=1 PP=1
EP sized to the expert topology, often large
Dispatcher: HybridEP on GB200-class systems
Recompute: start with full, then relax toward selective recompute
TP=1 CP=1 PP=1 or modest PP
EP and ETP sized to the expert topology
Dispatcher: HybridEP
CUDA Graph: start narrow, then widen only after the real-data path is stable
| Feature | FSDP | 3D parallel |
|---|---|---|
| HybridEP on GB200 | strong default | strong default once topology is stable |
| CUDA graphs | useful after bring-up | useful, but more scope-sensitive |
| Freeze vision | natural fit | possible, but less often used as the headline perf path |
| Selective recompute | recommended | recommended |
Mock multimodal data is misleading: it can make the decoder look much healthier than the real end-to-end VLM path.
The vision encoder can dominate unexpectedly: profile encoder, projector, and decoder separately before attributing everything to the dispatcher.
Do not compare FSDP and 3D-parallel runs with different effective work: normalize by useful tokens and workload shape, not only by step time.
ETP is not free: use it as a fit or topology tool, not as the default.
Recompute and CUDA-graph choices are coupled: the setting that gets the model to fit is often not the setting that gives the best steady-state speed.
Structured single-agent code review workflow for PRs, commits, and local diffs. Use when asked to review code, understand a PR, rubber duck a change, prepare GitHub review comments, compare a change against Megatron Bridge conventions, or produce high-signal findings without subagents or tmux.
Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.
Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
Systematic workflow for MoE training optimization in Megatron Bridge, based on the Megatron-Core MoE paper. Covers the Three Walls framework, parallel folding, recompute strategy, dispatcher choice, and CUDA-graph bring-up.
Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.