Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

distributed-training

Name: Distributed Training
Author: NVIDIA-NeMo

// Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.

In Manus ausführen

$ git log --oneline --stat

stars:521

forks:160

updated:4. Mai 2026 um 19:34

SKILL.md

readonly

related-skills.json

gleiches Repository

fern-docs.md

from "NVIDIA-NeMo/Automodel"

Maintain the NeMo AutoModel Fern docs site under fern/ — add, update, move, or remove pages; manage redirects, slugs, navigation, and version aliases; run validation and previews.

2026-05-19521

model-onboarding.md

from "NVIDIA-NeMo/Automodel"

Guide for onboarding new model families into NeMo AutoModel, including architecture discovery, implementation patterns, registration, and validation.

2026-05-15521

build-and-dependency.md

from "NVIDIA-NeMo/Automodel"

Dev environment setup for NeMo AutoModel — container-based development, uv package management, installation options, environment variables, and common build pitfalls.

2026-05-04521

cicd.md

from "NVIDIA-NeMo/Automodel"

CI/CD reference for NeMo AutoModel — pipeline structure, commit and PR workflow, CI failure investigation, and common failure patterns.

2026-05-04521

launcher-config.md

from "NVIDIA-NeMo/Automodel"

Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.

2026-05-04521

linting-and-formatting.md

from "NVIDIA-NeMo/Automodel"

Code style and quality rules for NeMo AutoModel — ruff configuration, naming conventions, type hints, docstrings, copyright headers, and the code review checklist.

2026-05-04521

package.json

"author": "NVIDIA-NeMo"

"repository": "NVIDIA-NeMo/Automodel"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

name	distributed-training
description	Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.
when_to_use	Adding or modifying distributed training strategies (FSDP2, HSDP, DDP), debugging multi-GPU or multi-node failures, configuring context or tensor parallelism, or tuning sharding settings.

Distributed Training in NeMo AutoModel

NeMo AutoModel uses PyTorch-native distributed training. All parallelism is orchestrated through a single MeshContext object that holds device meshes, strategy configs, and axis names.

Strategy Selection

Three strategies are available, selected via the distributed.strategy YAML key:

Strategy	YAML value	Best for
FSDP2	`fsdp2`	General use, recommended default. Supports TP, PP, CP, EP, HSDP.
MegatronFSDP	`megatron_fsdp`	NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel.
DDP	`ddp`	Simple data parallelism only. No TP, PP, CP, or EP.

Decision tree:

Single GPU: no distributed config needed (FSDP2Manager skips parallelization when world_size=1).
Multi-GPU single node: fsdp2 (default). Use ddp only if you need the simplest possible setup.
Multi-node: fsdp2 with appropriate TP/PP sizing.
MoE models with expert parallelism: fsdp2 with ep_size > 1 (creates a separate moe_mesh).
Large models (70B+): fsdp2 with PP + TP.
Long sequences (8K+): add CP (cp_size > 1).

YAML Config Structure

The distributed section in the recipe YAML maps directly to parse_distributed_section() in recipes/_dist_setup.py:

distributed:
  strategy: fsdp2           # fsdp2 | megatron_fsdp | ddp
  dp_size: none             # auto-calculated from world_size / (tp * pp * cp)
  dp_replicate_size: none   # FSDP2-only, for HSDP
  tp_size: 1
  pp_size: 1
  cp_size: 1
  ep_size: 1

  # Strategy-specific flags (forwarded to the strategy dataclass):
  sequence_parallel: false
  activation_checkpointing: false
  defer_fsdp_grad_sync: true   # FSDP2 only

  # Sub-configs (optional):
  pipeline:
    pp_schedule: 1f1b
    pp_microbatch_size: 1
    # ... see PipelineConfig fields

  moe:
    reshard_after_forward: false
    # ... see MoEParallelizerConfig fields

The dp_size is always inferred:

dp_size = world_size / (tp_size * pp_size * cp_size)

Infrastructure Flow

YAML distributed section
    -> parse_distributed_section()          [recipes/_dist_setup.py]
    -> setup_distributed()                  [recipes/_dist_setup.py]
        -> create_device_mesh()             [components/distributed/device_mesh.py]
        -> MeshContext(...)                  [components/distributed/mesh.py]
    -> instantiate_infrastructure()         [_transformers/infrastructure.py]
        -> _instantiate_distributed()       -> FSDP2Manager / MegatronFSDPManager / DDPManager
        -> _instantiate_pipeline()          -> AutoPipeline (if pp_size > 1)
        -> parallelize_fn                   -> MoE parallelizer (if ep_size > 1) or PP wrapper
    -> apply_model_infrastructure()         [_transformers/infrastructure.py]
        -> _shard_pp() or _shard_ep_fsdp()  (applies sharding to the model)

FSDP2 Configuration

Basic FSDP2 (data parallelism only)

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1

This auto-calculates dp_size = world_size and applies fully_shard() per transformer block via DTensor-based sharding.

FSDP2 with Tensor Parallelism

Keep TP within a single NVLink domain (typically one node):

distributed:
  strategy: fsdp2
  tp_size: 4        # 2, 4, or 8 -- must divide GPUs per node
  sequence_parallel: true

The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:

config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)

FSDP2 with Pipeline Parallelism

distributed:
  strategy: fsdp2
  pp_size: 2
  pipeline:
    pp_schedule: interleaved1f1b   # 1f1b, gpipe, interleaved_1f1b, etc.
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

The model must have a _pp_plan attribute (set on the HF model class) for AutoPipeline to know how to split layers across stages. Models without _pp_plan are not compatible with PP.

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

Intra-node full sharding + inter-node replication via a 2D DeviceMesh:

distributed:
  strategy: fsdp2
  dp_replicate_size: 2   # must divide dp_size

Constraint: dp_replicate_size < dp_size (pure replication with no sharding is not supported by FSDP2).

Activation Checkpointing

Trades compute for memory by recomputing activations during backward:

distributed:
  activation_checkpointing: true

This is forwarded to the strategy config for non-EP models, or read from MeshContext.activation_checkpointing for EP models.

Gradient Sync Deferral

FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:

distributed:
  defer_fsdp_grad_sync: true   # default

Mixed Precision

FSDP2Config defaults to bfloat16 for all three precision knobs via MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True). Override via the Python API:

from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
    mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)

Pipeline Parallelism

Requirements

Model class must define _pp_plan (a dict mapping module FQNs to stages).
pp_size > 1 in the distributed section.
A pipeline sub-config with schedule and microbatch size.

Supported schedules

Defined in PipelineConfig.pp_schedule:

1f1b (one-forward-one-backward, default)
gpipe
interleaved_1f1b / interleaved1f1b
looped_bfs
dfs
v_schedule
zero_bubble

Example (8B model on 8 GPUs, PP=2 + DP=4)

distributed:
  strategy: fsdp2
  pp_size: 2

  pipeline:
    pp_schedule: interleaved1f1b
    pp_microbatch_size: 4
    scale_grads_in_schedule: false

checkpoint:
  model_save_format: safetensors
  save_consolidated: true

How it works

AutoPipeline.build() calls pipeline_model() which splits the model into stages using the model's _pp_plan, creates PipelineStage objects, and builds the schedule. During training, schedule.step() drives forward and backward through the pipeline.

Context Parallelism

Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.

Config

distributed:
  strategy: fsdp2
  cp_size: 2   # or 4, 8

Requirements

SDPA (Flash Attention or Efficient Attention backend) or Transformer Engine attention. SDPBackend.MATH is not compatible with DTensor.
Attention masks are automatically stripped; is_causal=True is set via forward pre-hooks registered by attach_context_parallel_hooks().

How it works

After model sharding, apply_model_infrastructure() calls attach_context_parallel_hooks() on each model part (for non-TE models).
At each training step, make_cp_batch_and_ctx() creates a CP context manager that shards the batch along the sequence dimension and sets up context_parallel() from torch.distributed.tensor.experimental.
For TE attention models, make_cp_batch_for_te() uses THD format and TE's thd_get_partitioned_indices for sharding.

CP with Sequence Packing

CP works with packed sequences. The packed_sequence_size must be divisible by cp_size. When using TE, chunks are sharded per-chunk via _shard_thd_chunk_for_te().

Sequence Packing

Packing multiple sequences into a single training sample for efficiency.

Config

packed_sequence:
  packed_sequence_size: 4096   # 0 = disabled

step_scheduler:
  local_batch_size: 1          # must be 1 for packed sequences

When packed_sequence_size > 0, the dataset collator packs sequences up to that length. local_batch_size must be 1 because each "sample" is already a packed batch.

CP compatibility

When using CP with packed sequences, packed_sequence_size must be evenly divisible by cp_size.

MoE Distributed Training

Expert Parallelism

Set ep_size > 1 to distribute experts across GPUs. This creates a separate moe_mesh alongside the main device_mesh:

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

The moe_mesh shape is (pp_size, ep_shard_size, ep_size) with dimension names ("pp", "ep_shard", "ep").

Constraint: dp_cp_size (= dp_size * cp_size) must be divisible by ep_size.

MoE sub-config

distributed:
  strategy: fsdp2
  ep_size: 8
  activation_checkpointing: true

  moe:
    reshard_after_forward: false
    ignore_router_for_ac: false
    wrap_outer_model: true

The moe sub-section maps to MoEParallelizerConfig and is only instantiated when ep_size > 1.

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 8
  sequence_parallel: false
  activation_checkpointing: true

MegatronFSDP limitations

Despite its name, megatron_fsdp does not support expert parallelism (ep_size > 1), pipeline parallelism (pp_size > 1), or sequence_parallel. Use fsdp2 for these features.

Parallelism Sizing Guidelines

Dense models

Model size	TP	PP	CP	Strategy
< 3B	1	1	1	FSDP2 (DP only)
3-13B	2-4	1	1	FSDP2 + TP
13-70B	4-8	2-4	1	FSDP2 + TP + PP
70B+	8	4-8	1	FSDP2 + TP + PP
Any + long seq (8K+)	as above	as above	2-8	add CP

MoE models

MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:

Model	TP	PP	EP	Notes
Small MoE (<10B total)	1	1	8	EP only
Medium MoE (10-30B total)	1-2	1	8	small TP for shared layers
Large MoE (100B+ total)	1-2	4+	8-64	PP for depth, EP for experts

Hardware topology rules

TP must stay within a single NVLink domain (one node, typically 8 GPUs).
Use PP or DP for cross-node scaling.
TP across InfiniBand degrades throughput severely.

Programmatic API (from_pretrained / from_config)

When not using YAML recipes, configure distributed training via Python:

from nemo_automodel.components.distributed.config import FSDP2Config
from nemo_automodel.components.distributed.device_mesh import create_device_mesh
from nemo_automodel.components.distributed.mesh import MeshContext
from nemo_automodel._transformers.infrastructure import instantiate_infrastructure

# 1. Create strategy config
config = FSDP2Config(sequence_parallel=True, activation_checkpointing=True)

# 2. Create device mesh
device_mesh, moe_mesh = create_device_mesh(
    config, tp_size=2, pp_size=1, cp_size=1, ep_size=1, world_size=8,
)

# 3. Build MeshContext
mesh = MeshContext.from_meshes(
    device_mesh, moe_mesh, strategy_config=config, activation_checkpointing=True,
)

# 4. Instantiate infrastructure
model_wrapper, autopipeline, parallelize_fn, qat_quantizer = instantiate_infrastructure(
    distributed_config=config, mesh=mesh,
)

Or pass directly to from_pretrained:

model = NeMoAutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    distributed_config=FSDP2Config(activation_checkpointing=True),
    tp_size=2,
)

Code Anchors

Strategy config dataclasses:

components/distributed/config.py
    FSDP2Config       -- sequence_parallel, tp_plan, mp_policy, offload_policy,
                         activation_checkpointing, defer_fsdp_grad_sync
    MegatronFSDPConfig -- zero_dp_strategy, overlap_grad_reduce, overlap_param_gather, etc.
    DDPConfig          -- activation_checkpointing only

MeshContext (single source of truth for parallelism):

components/distributed/mesh.py
    MeshContext  -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config
                    Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size
    STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig}
    MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD

Device mesh creation:

components/distributed/device_mesh.py
    create_device_mesh()          -- routes to FSDP2/MegatronFSDP/DDP mesh creation
    _create_fsdp2_device_mesh()   -- shape (pp, dp_replicate, dp_shard, cp, tp) + flattened submeshes
    _create_moe_mesh()            -- shape (pp, ep_shard, ep)

Distributed managers:

components/distributed/fsdp2.py          -- FSDP2Manager.parallelize()
components/distributed/megatron_fsdp.py  -- MegatronFSDPManager.parallelize()
components/distributed/ddp.py            -- DDPManager

Pipeline parallelism:

components/distributed/pipelining/config.py        -- PipelineConfig dataclass
components/distributed/pipelining/autopipeline.py  -- AutoPipeline orchestrator
components/distributed/pipelining/functional.py    -- pipeline_model(), schedule creation
components/distributed/pipelining/hf_utils.py      -- HF model validation for PP

Context parallelism:

components/distributed/cp_utils.py
    make_cp_batch_and_ctx()            -- creates CP context manager + shards batch
    create_context_parallel_ctx()      -- wraps torch.distributed.tensor.experimental.context_parallel
    attach_context_parallel_hooks()    -- strips attention_mask, sets is_causal=True
    make_cp_batch_for_te()             -- TE-specific CP batch sharding (THD format)

Infrastructure orchestration:

_transformers/infrastructure.py
    instantiate_infrastructure()    -- config objects -> runtime objects
    apply_model_infrastructure()    -- applies sharding, PEFT, checkpoints to model
    _shard_pp()                     -- pipeline parallel path
    _shard_ep_fsdp()                -- EP + FSDP path (non-PP)

YAML parsing:

recipes/_dist_setup.py
    parse_distributed_section()  -- YAML dict -> typed configs + sizes
    setup_distributed()          -- full entry-point: parse + create meshes + MeshContext

MoE config:

components/moe/config.py
    MoEParallelizerConfig  -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.
    MoEConfig              -- n_routed_experts, n_activated_experts, score_func, etc.

Pitfalls

TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
PP requires _pp_plan on the model class. Not all HF models have this. Check validate_hf_model_for_pipeline_support() before enabling PP.
PP bubbles reduce GPU utilization. Use interleaved schedules (interleaved_1f1b) and smaller microbatches to reduce bubble time.
FSDP2 requires DTensor-aware state dict saving. Use safetensors with save_consolidated: true for checkpoint compatibility.
CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only. SDPBackend.MATH is not compatible with DTensor.
MoE EP size must evenly divide dp_size * cp_size. The device mesh creation asserts dp_cp_size % ep_size == 0.
MegatronFSDP is more limited than FSDP2. It does not support PP (pp_size > 1), EP (ep_size > 1), or sequence_parallel. The MeshContext validation raises on these combinations.
DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom MixedPrecisionPolicy.
packed_sequence_size must be divisible by cp_size when using CP with packed sequences.
dp_replicate_size is FSDP2-only. Passing it with megatron_fsdp or ddp raises a ValueError.

Verification

Basic FSDP2 (2 GPUs)

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc-per-node=2 \
  recipes/llm_finetune/finetune.py \
  --config examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml

Pipeline parallelism (4 GPUs, PP=2)

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node=4 \
  recipes/llm_finetune/finetune.py \
  --config examples/llm_finetune/llama3_1/llama3_1_8b_hellaswag_pp.yaml

MegatronFSDP (2 GPUs)

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc-per-node=2 \
  recipes/llm_finetune/finetune.py \
  --config examples/llm_finetune/llama3_2/llama3_2_1b_squad_megatron_fsdp.yaml

MoE with EP (8 GPUs)

torchrun --nproc-per-node=8 \
  recipes/llm_finetune/finetune.py \
  --config examples/llm_finetune/qwen/qwen3_moe_30b_te_deepep.yaml

Success criteria:

Exit code 0
Finite loss values in logs (not NaN or Inf)
No NCCL timeout errors
Correct parallelism layout printed in logs (TP/PP/DP/EP sizes match config)

distributed-training

Mehr aus diesem Repository

Mehr aus diesem Repository

Distributed Training in NeMo AutoModel

Strategy Selection

YAML Config Structure

Infrastructure Flow

FSDP2 Configuration

Basic FSDP2 (data parallelism only)

FSDP2 with Tensor Parallelism

FSDP2 with Pipeline Parallelism

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

Activation Checkpointing

Gradient Sync Deferral

Mixed Precision

Pipeline Parallelism

Requirements

Supported schedules

Example (8B model on 8 GPUs, PP=2 + DP=4)

How it works

Context Parallelism

Config

Requirements

How it works

CP with Sequence Packing

Sequence Packing

Config

CP compatibility

MoE Distributed Training

Expert Parallelism

MoE sub-config

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

MegatronFSDP limitations

Parallelism Sizing Guidelines

Dense models

MoE models

Hardware topology rules

Programmatic API (from_pretrained / from_config)

Code Anchors

Pitfalls

Verification

Basic FSDP2 (2 GPUs)

Pipeline parallelism (4 GPUs, PP=2)

MegatronFSDP (2 GPUs)

MoE with EP (8 GPUs)

Distributed Training in NeMo AutoModel

Strategy Selection

YAML Config Structure

Infrastructure Flow

FSDP2 Configuration

Basic FSDP2 (data parallelism only)

FSDP2 with Tensor Parallelism

FSDP2 with Pipeline Parallelism

FSDP2 with HSDP (Hybrid Sharded Data Parallel)

Activation Checkpointing

Gradient Sync Deferral

Mixed Precision

Pipeline Parallelism

Requirements

Supported schedules

Example (8B model on 8 GPUs, PP=2 + DP=4)

How it works

Context Parallelism

Config

Requirements

How it works

CP with Sequence Packing

Sequence Packing

Config

CP compatibility

MoE Distributed Training

Expert Parallelism

MoE sub-config

Full MoE example (Qwen3-30B-A3B on 8 GPUs)

MegatronFSDP limitations

Parallelism Sizing Guidelines

Dense models

MoE models

Hardware topology rules

Programmatic API (from_pretrained / from_config)