mit einem Klick
distributed-training
// Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.
// Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.
Maintain the NeMo AutoModel Fern docs site under fern/ — add, update, move, or remove pages; manage redirects, slugs, navigation, and version aliases; run validation and previews.
Guide for onboarding new model families into NeMo AutoModel, including architecture discovery, implementation patterns, registration, and validation.
Dev environment setup for NeMo AutoModel — container-based development, uv package management, installation options, environment variables, and common build pitfalls.
CI/CD reference for NeMo AutoModel — pipeline structure, commit and PR workflow, CI failure investigation, and common failure patterns.
Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.
Code style and quality rules for NeMo AutoModel — ruff configuration, naming conventions, type hints, docstrings, copyright headers, and the code review checklist.
| name | distributed-training |
| description | Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings. |
| when_to_use | Adding or modifying distributed training strategies (FSDP2, HSDP, DDP), debugging multi-GPU or multi-node failures, configuring context or tensor parallelism, or tuning sharding settings. |
NeMo AutoModel uses PyTorch-native distributed training.
All parallelism is orchestrated through a single MeshContext object that
holds device meshes, strategy configs, and axis names.
Three strategies are available, selected via the distributed.strategy YAML key:
| Strategy | YAML value | Best for |
|---|---|---|
| FSDP2 | fsdp2 | General use, recommended default. Supports TP, PP, CP, EP, HSDP. |
| MegatronFSDP | megatron_fsdp | NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel. |
| DDP | ddp | Simple data parallelism only. No TP, PP, CP, or EP. |
Decision tree:
fsdp2 (default). Use ddp only if you need the simplest possible setup.fsdp2 with appropriate TP/PP sizing.fsdp2 with ep_size > 1 (creates a separate moe_mesh).fsdp2 with PP + TP.cp_size > 1).The distributed section in the recipe YAML maps directly to
parse_distributed_section() in recipes/_dist_setup.py:
distributed:
strategy: fsdp2 # fsdp2 | megatron_fsdp | ddp
dp_size: none # auto-calculated from world_size / (tp * pp * cp)
dp_replicate_size: none # FSDP2-only, for HSDP
tp_size: 1
pp_size: 1
cp_size: 1
ep_size: 1
# Strategy-specific flags (forwarded to the strategy dataclass):
sequence_parallel: false
activation_checkpointing: false
defer_fsdp_grad_sync: true # FSDP2 only
# Sub-configs (optional):
pipeline:
pp_schedule: 1f1b
pp_microbatch_size: 1
# ... see PipelineConfig fields
moe:
reshard_after_forward: false
# ... see MoEParallelizerConfig fields
The dp_size is always inferred:
dp_size = world_size / (tp_size * pp_size * cp_size)
YAML distributed section
-> parse_distributed_section() [recipes/_dist_setup.py]
-> setup_distributed() [recipes/_dist_setup.py]
-> create_device_mesh() [components/distributed/device_mesh.py]
-> MeshContext(...) [components/distributed/mesh.py]
-> instantiate_infrastructure() [_transformers/infrastructure.py]
-> _instantiate_distributed() -> FSDP2Manager / MegatronFSDPManager / DDPManager
-> _instantiate_pipeline() -> AutoPipeline (if pp_size > 1)
-> parallelize_fn -> MoE parallelizer (if ep_size > 1) or PP wrapper
-> apply_model_infrastructure() [_transformers/infrastructure.py]
-> _shard_pp() or _shard_ep_fsdp() (applies sharding to the model)
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
This auto-calculates dp_size = world_size and applies fully_shard() per
transformer block via DTensor-based sharding.
Keep TP within a single NVLink domain (typically one node):
distributed:
strategy: fsdp2
tp_size: 4 # 2, 4, or 8 -- must divide GPUs per node
sequence_parallel: true
The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:
config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)
distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b # 1f1b, gpipe, interleaved_1f1b, etc.
pp_microbatch_size: 4
scale_grads_in_schedule: false
The model must have a _pp_plan attribute (set on the HF model class) for
AutoPipeline to know how to split layers across stages. Models without
_pp_plan are not compatible with PP.
Intra-node full sharding + inter-node replication via a 2D DeviceMesh:
distributed:
strategy: fsdp2
dp_replicate_size: 2 # must divide dp_size
Constraint: dp_replicate_size < dp_size (pure replication with no sharding
is not supported by FSDP2).
Trades compute for memory by recomputing activations during backward:
distributed:
activation_checkpointing: true
This is forwarded to the strategy config for non-EP models, or read from
MeshContext.activation_checkpointing for EP models.
FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:
distributed:
defer_fsdp_grad_sync: true # default
FSDP2Config defaults to bfloat16 for all three precision knobs via
MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True). Override via the Python API:
from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)
_pp_plan (a dict mapping module FQNs to stages).pp_size > 1 in the distributed section.pipeline sub-config with schedule and microbatch size.Defined in PipelineConfig.pp_schedule:
1f1b (one-forward-one-backward, default)gpipeinterleaved_1f1b / interleaved1f1blooped_bfsdfsv_schedulezero_bubbledistributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 4
scale_grads_in_schedule: false
checkpoint:
model_save_format: safetensors
save_consolidated: true
AutoPipeline.build() calls pipeline_model() which splits the model into
stages using the model's _pp_plan, creates PipelineStage objects, and
builds the schedule. During training, schedule.step() drives forward and
backward through the pipeline.
Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.
distributed:
strategy: fsdp2
cp_size: 2 # or 4, 8
is_causal=True is set via
forward pre-hooks registered by attach_context_parallel_hooks().apply_model_infrastructure() calls
attach_context_parallel_hooks() on each model part (for non-TE models).make_cp_batch_and_ctx() creates a CP context
manager that shards the batch along the sequence dimension and sets up
context_parallel() from torch.distributed.tensor.experimental.make_cp_batch_for_te() uses THD format and
TE's thd_get_partitioned_indices for sharding.CP works with packed sequences. The packed_sequence_size must be divisible
by cp_size. When using TE, chunks are sharded per-chunk via
_shard_thd_chunk_for_te().
Packing multiple sequences into a single training sample for efficiency.
packed_sequence:
packed_sequence_size: 4096 # 0 = disabled
step_scheduler:
local_batch_size: 1 # must be 1 for packed sequences
When packed_sequence_size > 0, the dataset collator packs sequences up to
that length. local_batch_size must be 1 because each "sample" is already a
packed batch.
When using CP with packed sequences, packed_sequence_size must be evenly
divisible by cp_size.
Set ep_size > 1 to distribute experts across GPUs. This creates a separate
moe_mesh alongside the main device_mesh:
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: true
The moe_mesh shape is (pp_size, ep_shard_size, ep_size) with dimension
names ("pp", "ep_shard", "ep").
Constraint: dp_cp_size (= dp_size * cp_size) must be divisible by
ep_size.
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: true
moe:
reshard_after_forward: false
ignore_router_for_ac: false
wrap_outer_model: true
The moe sub-section maps to MoEParallelizerConfig and is only
instantiated when ep_size > 1.
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
pp_size: 1
ep_size: 8
sequence_parallel: false
activation_checkpointing: true
Despite its name, megatron_fsdp does not support expert parallelism
(ep_size > 1), pipeline parallelism (pp_size > 1), or
sequence_parallel. Use fsdp2 for these features.
| Model size | TP | PP | CP | Strategy |
|---|---|---|---|---|
| < 3B | 1 | 1 | 1 | FSDP2 (DP only) |
| 3-13B | 2-4 | 1 | 1 | FSDP2 + TP |
| 13-70B | 4-8 | 2-4 | 1 | FSDP2 + TP + PP |
| 70B+ | 8 | 4-8 | 1 | FSDP2 + TP + PP |
| Any + long seq (8K+) | as above | as above | 2-8 | add CP |
MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:
| Model | TP | PP | EP | Notes |
|---|---|---|---|---|
| Small MoE (<10B total) | 1 | 1 | 8 | EP only |
| Medium MoE (10-30B total) | 1-2 | 1 | 8 | small TP for shared layers |
| Large MoE (100B+ total) | 1-2 | 4+ | 8-64 | PP for depth, EP for experts |
When not using YAML recipes, configure distributed training via Python:
from nemo_automodel.components.distributed.config import FSDP2Config
from nemo_automodel.components.distributed.device_mesh import create_device_mesh
from nemo_automodel.components.distributed.mesh import MeshContext
from nemo_automodel._transformers.infrastructure import instantiate_infrastructure
# 1. Create strategy config
config = FSDP2Config(sequence_parallel=True, activation_checkpointing=True)
# 2. Create device mesh
device_mesh, moe_mesh = create_device_mesh(
config, tp_size=2, pp_size=1, cp_size=1, ep_size=1, world_size=8,
)
# 3. Build MeshContext
mesh = MeshContext.from_meshes(
device_mesh, moe_mesh, strategy_config=config, activation_checkpointing=True,
)
# 4. Instantiate infrastructure
model_wrapper, autopipeline, parallelize_fn, qat_quantizer = instantiate_infrastructure(
distributed_config=config, mesh=mesh,
)
Or pass directly to from_pretrained:
model = NeMoAutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
distributed_config=FSDP2Config(activation_checkpointing=True),
tp_size=2,
)
Strategy config dataclasses:
components/distributed/config.py
FSDP2Config -- sequence_parallel, tp_plan, mp_policy, offload_policy,
activation_checkpointing, defer_fsdp_grad_sync
MegatronFSDPConfig -- zero_dp_strategy, overlap_grad_reduce, overlap_param_gather, etc.
DDPConfig -- activation_checkpointing only
MeshContext (single source of truth for parallelism):
components/distributed/mesh.py
MeshContext -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config
Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size
STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig}
MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD
Device mesh creation:
components/distributed/device_mesh.py
create_device_mesh() -- routes to FSDP2/MegatronFSDP/DDP mesh creation
_create_fsdp2_device_mesh() -- shape (pp, dp_replicate, dp_shard, cp, tp) + flattened submeshes
_create_moe_mesh() -- shape (pp, ep_shard, ep)
Distributed managers:
components/distributed/fsdp2.py -- FSDP2Manager.parallelize()
components/distributed/megatron_fsdp.py -- MegatronFSDPManager.parallelize()
components/distributed/ddp.py -- DDPManager
Pipeline parallelism:
components/distributed/pipelining/config.py -- PipelineConfig dataclass
components/distributed/pipelining/autopipeline.py -- AutoPipeline orchestrator
components/distributed/pipelining/functional.py -- pipeline_model(), schedule creation
components/distributed/pipelining/hf_utils.py -- HF model validation for PP
Context parallelism:
components/distributed/cp_utils.py
make_cp_batch_and_ctx() -- creates CP context manager + shards batch
create_context_parallel_ctx() -- wraps torch.distributed.tensor.experimental.context_parallel
attach_context_parallel_hooks() -- strips attention_mask, sets is_causal=True
make_cp_batch_for_te() -- TE-specific CP batch sharding (THD format)
Infrastructure orchestration:
_transformers/infrastructure.py
instantiate_infrastructure() -- config objects -> runtime objects
apply_model_infrastructure() -- applies sharding, PEFT, checkpoints to model
_shard_pp() -- pipeline parallel path
_shard_ep_fsdp() -- EP + FSDP path (non-PP)
YAML parsing:
recipes/_dist_setup.py
parse_distributed_section() -- YAML dict -> typed configs + sizes
setup_distributed() -- full entry-point: parse + create meshes + MeshContext
MoE config:
components/moe/config.py
MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.
MoEConfig -- n_routed_experts, n_activated_experts, score_func, etc.
TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
PP requires _pp_plan on the model class. Not all HF models have this.
Check validate_hf_model_for_pipeline_support() before enabling PP.
PP bubbles reduce GPU utilization. Use interleaved schedules
(interleaved_1f1b) and smaller microbatches to reduce bubble time.
FSDP2 requires DTensor-aware state dict saving. Use safetensors with
save_consolidated: true for checkpoint compatibility.
CP requires compatible attention. SDPA (Flash Attention or Efficient
Attention) or TE attention only. SDPBackend.MATH is not compatible with
DTensor.
MoE EP size must evenly divide dp_size * cp_size. The device mesh
creation asserts dp_cp_size % ep_size == 0.
MegatronFSDP is more limited than FSDP2. It does not support PP
(pp_size > 1), EP (ep_size > 1), or sequence_parallel. The
MeshContext validation raises on these combinations.
DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
Mixed precision policy must match model expectations. The default
bfloat16 policy works for most models. FP16 models may need a custom
MixedPrecisionPolicy.
packed_sequence_size must be divisible by cp_size when using CP
with packed sequences.
dp_replicate_size is FSDP2-only. Passing it with megatron_fsdp
or ddp raises a ValueError.
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc-per-node=2 \
recipes/llm_finetune/finetune.py \
--config examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node=4 \
recipes/llm_finetune/finetune.py \
--config examples/llm_finetune/llama3_1/llama3_1_8b_hellaswag_pp.yaml
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc-per-node=2 \
recipes/llm_finetune/finetune.py \
--config examples/llm_finetune/llama3_2/llama3_2_1b_squad_megatron_fsdp.yaml
torchrun --nproc-per-node=8 \
recipes/llm_finetune/finetune.py \
--config examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml
Success criteria: