with one click
nemo-mbridge-perf-moe-long-context
Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
Menu
Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
Structured single-agent code review workflow for PRs, commits, and local diffs. Use when asked to review code, understand a PR, rubber duck a change, prepare GitHub review comments, compare a change against Megatron Bridge conventions, or produce high-signal findings without subagents or tmux.
Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.
Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
Systematic workflow for MoE training optimization in Megatron Bridge, based on the Megatron-Core MoE paper. Covers the Three Walls framework, parallel folding, recompute strategy, dispatcher choice, and CUDA-graph bring-up.
Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments.
Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
| name | nemo-mbridge-perf-moe-long-context |
| description | Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments. |
| license | Apache-2.0 |
| when_to_use | Training MoE at long sequence lengths, or investigating a commit that caused long-context MoE OOM or degraded throughput; 'long context MoE', '128k tokens', 'CP sizing for long sequences', 'selective recompute long context', 'MoE long-context OOM'. |
Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-long-context/card.yaml
Once sequence length moves well past the 4K-class regime, attention memory and activation residency become the dominant constraints. For MoE models, that usually means you need some combination of:
The DSV3 long-context runs show a stable pattern:
In other words, long context does not immediately collapse utilization if the layout is chosen well, but it does consume the DP budget very quickly.
Qwen3-Next behaves more like a memory-sensitive medium-scale model:
Qwen3 235B shows that long context can still be efficient on NVL72 systems when TP, CP, and HybridEP are coordinated. The best 128K-class configurations are not just "fit-only" recipes; they can remain highly efficient if routing, parallelism, and recompute are balanced.
Start from a 4K shard target: a good first guess is
CP ~= seq_len / 4096, then round to a practical power-of-two layout.
Keep DP alive if possible: long-context scaling becomes brittle once CP, EP, TP, and PP together squeeze DP down to the floor.
Prefer selective recompute: recompute modules such as up_proj, norm,
moe, moe_act, or mlp before reaching for full recompute.
Avoid SDPA-heavy recompute at very long context: recomputing attention internals can add a lot of work for less memory benefit than recomputing smaller MoE and MLP-side modules.
Use TP as another lever on NVL72 systems: GB200 and GB300 runs can sometimes trade some CP for TP while still staying efficient.
Assume GBS will need to shrink: as CP rises and DP falls, you may need to reduce global batch size or accept higher GA.
TP=1 CP=32 EP=32 PP=8 VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
TP=1 CP=64 EP=32 PP=8 EDP=2 VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
TP=4 CP=4 EP=32 PP=4 VPP=12
Precision: BF16 or MXFP8
Dispatcher: HybridEP
Recompute: moe_act, norm
CUDA Graph: attn + moe_router + moe_preprocess
For long-context MoE training:
Useful references:
CP does not replace EP or PP: it adds another dimension; it does not make the others disappear.
A good 4K baseline can still be a bad long-context baseline: routing mode, recompute choice, and offload strategy often need to change.
GPU-count feasibility becomes the real constraint: very long context can look fine in a single recipe, then become impossible once EP and PP are added honestly across the full model.
CUDA graphs need static shapes: variable-length batches and opportunistic padding strategies can silently break the path.
Container and kernel support matters more at 128K+: long-context paths tend to rely on newer kernels and bug fixes than short-context bring-up does.