with one click
nemo-mbridge-perf-moe-hardware-configs
Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
Menu
Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
| name | nemo-mbridge-perf-moe-hardware-configs |
| description | Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks. |
| license | Apache-2.0 |
| when_to_use | Hardware-specific MoE playbooks or throughput estimates; 'MoE on H100', 'GB200 config', 'expected throughput', 'MoE hardware playbook', 'parallelism for B200'. |
Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-hardware-configs/card.yaml
| Platform | Typical MoE strategy | What usually matters most |
|---|---|---|
| H100 | DeepEP + stronger PP + moderate TP | communication overlap and PP efficiency |
| B200 | DeepEP + MXFP8 + careful PP layout | container quality and tuned comm settings |
| GB200 | HybridEP + partial CUDA graphs + CPU cleanup | host overhead, topology-aware dispatch, memory headroom |
| GB300 | HybridEP + newer FP8 and kernel stack | same GB200 playbook, usually with a higher ceiling |
For hardware playbook questions, answer from these canonical rows before adding throughput caveats:
| Workload | Hardware | Dispatcher | Layout |
|---|---|---|---|
| DSV3 | H100 | DeepEP | TP=2, EP=64, PP=8, VPP=4 |
| DSV3 | GB200/GB300 | HybridEP | TP=1, EP=64, PP=4, VPP=4 |
| Qwen3 235B | H100 | DeepEP | TP=2, EP=32, PP=8, VPP=4 |
| Qwen3 235B | GB200 | HybridEP | TP=1 or 2, EP=32-64, PP=4, VPP=unspecified |
For Qwen3 235B on GB200, explicitly say VPP=unspecified; do not invent or
extrapolate VPP=12 unless a measured row provides it. Include TE-scoped CUDA
graph scopes (attn, moe_router, moe_preprocess),
CUDA_DEVICE_MAX_CONNECTIONS selection,
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, NCCL_GRAPH_REGISTER=0,
GB200/GB300 CPU-side tuning, and the warning not to cargo-cult tracker rows.
These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.
| Workload family | Hardware | Typical band | Representative shape |
|---|---|---|---|
| DSV3, large-scale | H100 | low-to-mid hundreds TFLOPS/GPU, high-teens MFU | TP2, EP64, PP8, DeepEP |
| DSV3, large-scale | B200 | high-hundreds TFLOPS/GPU, mid-teens MFU | TP1, EP32, PP8, DeepEP |
| DSV3, large-scale | GB200 | around 1K TFLOPS/GPU, low-20s MFU | TP1, EP64, PP4, HybridEP |
| DSV3, large-scale | GB300 | above the GB200 band, often mid-20s MFU | TP1, EP64, PP4, HybridEP |
| Qwen3 235B | H100 | low-300s TFLOPS/GPU, around 30% MFU | TP2, EP32, PP8, DeepEP |
| Qwen3 235B | GB200 | high-hundreds TFLOPS/GPU in tuned runs | TP1 or TP2, EP32-64, PP4, HybridEP |
| Qwen3 30B | H100 | low-200s TFLOPS/GPU | TP1, EP8, PP1, DeepEP |
| Qwen3-Next 80B | GB200 | low-300s TFLOPS/GPU in BF16-class runs | TP1, EP32, PP2, HybridEP |
Dispatcher: DeepEP
TP=2 EP=64 PP=8 VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient
Dispatcher: DeepEP
TP=1 EP=32 PP=8 VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning
Dispatcher: HybridEP
TP=1 EP=64 PP=4 VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes
Dispatcher: DeepEP
TP=2 EP=32 PP=8 VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup
Dispatcher: HybridEP
TP=1 or 2 EP=32 to 64 PP=4 VPP=unspecified unless measured
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom
Dispatcher: HybridEP
TP=1 EP=32 PP=2 VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality
E = embeddingt = transformerm = MTPL = loss| = stage boundaryThe biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.
| Memory pressure | Starting point |
|---|---|
| low | none or a very narrow selective set |
| moderate | moe_act, mlp, norm, or similar selective modules |
| high | model-specific up-projection plus selective MoE and MLP modules |
| extreme or long-context | full recompute only if the selective path still does not fit |
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32 # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0
On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.
Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.
Container quality matters: large regressions can come from the software stack rather than the model recipe.
VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.
Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.
Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.
Structured single-agent code review workflow for PRs, commits, and local diffs. Use when asked to review code, understand a PR, rubber duck a change, prepare GitHub review comments, compare a change against Megatron Bridge conventions, or produce high-signal findings without subagents or tmux.
Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.
Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
Systematic workflow for MoE training optimization in Megatron Bridge, based on the Megatron-Core MoE paper. Covers the Three Walls framework, parallel folding, recompute strategy, dispatcher choice, and CUDA-graph bring-up.
Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments.
Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.