| name | parallelism-strategies |
| description | Operational guide for choosing and combining parallelism strategies (TP/PP/DP/CP/SP/EP) for the SkyRL Megatron backend. Use when sizing parallelism for a new model, debugging OOM/throughput on a given cluster topology, or extending an existing recipe to a new GPU count. Includes model-size sizing rules, hardware topology mapping, sequence-length thresholds, MoE-specific patterns, and pitfalls. |
Parallelism Strategy Selection
Source. Adapted from NVIDIA Megatron-Bridge docs:
https://docs.nvidia.com/nemo/megatron-bridge/latest/skills/perf-techniques/parallelism-strategies/SKILL.html
Re-fetch from upstream when bumping the megatron-bridge pin in pyproject.toml.
SkyRL adaptation. Upstream uses cfg.model.<field>. In SkyRL these are surfaced through MegatronConfig (skyrl/train/config.py) and set on the CLI as e.g. trainer.megatron.tensor_model_parallel_size=... for SFT and trainer.policy.megatron_config. for RL. Field names are otherwise identical.
Scope. Megatron backend only. FSDP and JAX backends do not use TP/PP/EP — see .claude/docs/backends/fsdp.md and .claude/docs/backends/jax.md.
Decision by Model Size
Dense models
| Model size | GPUs | Recommended starting point |
|---|
| < 1B | 1-8 | DP only |
| 1-10B | 8-16 | TP=2-4 + DP |
| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
MoE models
MoE parallelism differs from dense. Because only a fraction of parameters are active per token, TP can often stay at 1 or 2 — the active parameter shard already fits on a single GPU. EP is the primary scaling dimension, with PP handling cross-node layer distribution.
| Model (total / active) | TP | PP | EP | Notes |
|---|
| OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node |
| Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers |
| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all |
| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all |
| Qwen3 30B-A3B | 4 | 2 | 4 | |
| GLM-4.5 355B / 32B | 2 | 8 | 16 | |
| Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain |
| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 |
| Kimi-K2 1T | 2 | 16 | 32 | |
Key patterns:
- TP is sized by active params, not total params. A 671B MoE with 37B active needs far less TP than a 70B dense model.
- EP scales with expert count. Common:
EP = num_experts or num_experts / experts_per_gpu.
- PP handles depth. Large MoE models use PP=8-16 across nodes.
- ETP (expert tensor parallelism) is rarely used. Llama 4 is an exception (ETP=4).
These are starting points, not hard rules. Always profile the first iteration to verify memory and communication.
Decision by Hardware Topology
Single node with NVLink:
cfg.model.tensor_model_parallel_size = 8
Multiple nodes with InfiniBand/ RoCE:
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N
Limited network (Ethernet):
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M
Stable rule: keep TP within a single NVLink domain. Use PP or DP for cross-node scaling. TP across nodes is almost always a performance loss.
Decision by Sequence Length
| Sequence length | Recommendation |
|---|
| < 2K | standard TP + PP + DP |
| 2K-8K | add SP (sequence_parallel=True) |
| 8K-32K | add CP=2 |
| 32K+ | add CP=4-8, consider a2a+p2p for large CP |
Combined Parallelism
3D parallelism (TP + PP + DP):
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True
4D parallelism (TP + PP + CP + DP):
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True
MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):
cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False
MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True
DP size is always implicit:
data_parallel_size = world_size / (TP * PP * CP)
Memory Estimation
Without parallelism (70B model, FP16):
parameters: 140 GB
gradients: 140 GB
optimizer states: 280 GB (Adam)
activations: 48 GB (batch=1, seq=4K)
total: 608 GB
With TP=4, PP=4, DP=4 (64 GPUs):
parameters: 8.75 GB per GPU
gradients: 8.75 GB per GPU
optimizer states: 17.50 GB per GPU
activations: 3.00 GB per GPU
total: ~38 GB per GPU
Code Anchors
Parallelism dimensions set in the model provider:
model_config = GPTModelProvider(
tensor_model_parallel_size=2,
)
DP size:
data_parallel_size = world_size / (tensor_model_parallel_size * pipeline_model_parallel_size * context_parallel_size)
Megatron-Bridge wires parallelism into process groups via:
parallel_state.initialize_model_parallel(
tensor_model_parallel_size=model_config.tensor_model_parallel_size,
pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
context_parallel_size=model_config.context_parallel_size,
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
expert_model_parallel_size=model_config.expert_model_parallel_size,
...
)
Pitfalls
- TP across nodes destroys throughput. Always keep TP within a single NVLink domain.
- PP without interleaving has large pipeline bubbles. Use
virtual_pipeline_model_parallel_size when possible.
- SP requires
tensor_model_parallel_size > 1. Enabling SP alone without TP is a config error. (SkyRL note: SP is auto-enabled when TP > 1; no separate config knob — see .claude/docs/backends/megatron.md.)
- CP requires
seq_length % (2 * context_parallel_size) == 0.
- EP is only for MoE models. Setting
expert_model_parallel_size on a dense model is a no-op or error.
- The model-size-to-parallelism table is a starting heuristic. Always profile the first iteration to check memory and communication.
CUDA_DEVICE_MAX_CONNECTIONS and related env vars interact with overlap settings.