Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

perf-hierarchical-context-parallel

Name: Perf Hierarchical Context Parallel
Author: NVIDIA-NeMo

// Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

In Manus ausführen

$ git log --oneline --stat

stars:653

forks:324

updated:20. Mai 2026 um 21:51

Datei-Explorer

2 Dateien

SKILL.md

readonly

name	perf-hierarchical-context-parallel
description	Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
when_to_use	Scaling context parallelism beyond KV heads, or investigating a commit that changed CP config and caused OOM or a regression; 'hierarchical_context_parallel_sizes', 'a2a+p2p', 'hierarchical CP', 'CP beyond KV heads', 'multi-level CP'.

Hierarchical Context Parallel Skill

This skill covers hierarchical context parallelism: nested context-parallel process groups used by cp_comm_type="a2a+p2p" and configured with hierarchical_context_parallel_sizes.

For what hierarchical CP is, when to use it, and the decision tree (a2a+p2p vs pure a2a vs p2p), see:

@docs/training/hierarchical-context-parallel.md
@skills/perf-hierarchical-context-parallel/card.yaml

Enablement

Minimal Bridge override:

cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False

Required constraints:

prod(hierarchical_context_parallel_sizes) == context_parallel_size
seq_length % (2 * context_parallel_size) == 0
Transformer Engine >= 1.12.0

Code Anchors

Upstream config and validation:

context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""

hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify 
   the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
   groups of two levels, so the first value of the list indicates the group size of the a2a
   communication type, and the second value indicates the group size of the p2p communication
   type.
"""

if args.hierarchical_context_parallel_sizes:
    from numpy import prod
    assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
    assert args.hierarchical_context_parallel_sizes is not None, \
    "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"

Bridge MPU path:

parallel_state.initialize_model_parallel(
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    ...
)
...
return ProcessGroupCollection.use_mpu_process_groups()

Bridge decentralized-PG path:

pg_collection = ProcessGroupCollection(
    ...
    cp=cp_pg,
    tp_cp=tp_cp_pg,
    hcp=None,
    ep=ep_pg,
    ...
)

Implementation Map

Config definition

hierarchical_context_parallel_sizes is declared in ModelParallelConfig:

# 3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
hierarchical_context_parallel_sizes: Optional[list[int]] = None
# For a2a+p2p, first value = a2a group size, second value = p2p group size.
# Product must equal context_parallel_size.

cp_comm_type is declared in TransformerConfig:

# 3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
cp_comm_type: Optional[Union[str, List[str]]] = None
# Can be per-layer (List[str]) or uniform (str).
# Values: "p2p", "all_gather", "a2a", "a2a+p2p"

Validation (MCore)

TransformerConfig.__post_init__ enforces that a2a+p2p requires HCP sizes and the product matches CP.

Process group creation

parallel_state.initialize_model_parallel creates hierarchical CP sub-groups when HCP sizes are provided via create_hierarchical_groups. Bridge currently gets those groups through the MPU-backed ProcessGroupCollection.

TE integration

TEDotProductAttention passes the hierarchical groups to Transformer Engine when a2a+p2p is used. Requires Transformer Engine >= 1.12.0.

Pitfalls

Bridge HCP is MPU-only today: If use_decentralized_pg=True, Bridge initializes flat CP groups and leaves HCP unset.
No checked-in Bridge recipe currently exercises HCP directly.
Single-GPU load helpers clear hierarchical_context_parallel_sizes.
Silent broken training on old stacks: If you use a2a+p2p without setting hierarchical_context_parallel_sizes, MCore now asserts. Older versions would silently disable CP communication, so each rank attended only to its local chunk and produced artificially high throughput with broken gradients.
Product must match: prod(hierarchical_context_parallel_sizes) must exactly equal context_parallel_size. A mismatch triggers an assertion.
Verify in logs: Look for the process group initialization output. You should see HIERARCHICAL_CONTEXT_PARALLEL_GROUPS being created. If you only see CONTEXT_PARALLEL_GROUP, HCP is not active.

Verification

No dedicated Bridge end-to-end test exists yet for HCP (see @skills/perf-hierarchical-context-parallel/card.yaml follow_up_validation). Use the existing unit tests and log inspection instead.

Run the decentralized-PG unit test to confirm the flat-CP behavior is preserved:

uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q

For a manual smoke check, launch a 4-GPU run with a small recipe and cp_comm_type=a2a+p2p plus hierarchical_context_parallel_sizes=[2,2]:

CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.context_parallel_size=4 \
  model.cp_comm_type=a2a+p2p \
  "model.hierarchical_context_parallel_sizes=[2,2]" \
  train.train_iters=2

Success criteria:

Logs show HIERARCHICAL_CONTEXT_PARALLEL_GROUPS being created
Training completes at least one step without error
If you only see CONTEXT_PARALLEL_GROUP, HCP is not active

related-skills.json

gleiches Repository

adding-model-support.md

from "NVIDIA-NeMo/Megatron-Bridge"

Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.

2026-05-22653

perf-sequence-packing.md

from "NVIDIA-NeMo/Megatron-Bridge"

Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.

2026-05-22653

perf-parallelism-strategies.md

from "NVIDIA-NeMo/Megatron-Bridge"

Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.

2026-05-20653

resiliency.md

from "NVIDIA-NeMo/Megatron-Bridge"

Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.

2026-05-20653

multi-node-slurm.md

from "NVIDIA-NeMo/Megatron-Bridge"

Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.

2026-05-18653

nemo-rl-e2e-testing.md

from "NVIDIA-NeMo/Megatron-Bridge"

External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GRPO/SFT/checkpoint/non-colocated refit variants, setting PYTHONPATH so NeMo-RL imports the local Bridge tree, and reporting pass/fail evidence.

2026-05-18653

package.json

"author": "NVIDIA-NeMo"

"repository": "NVIDIA-NeMo/Megatron-Bridge"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

name	perf-hierarchical-context-parallel
description	Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
when_to_use	Scaling context parallelism beyond KV heads, or investigating a commit that changed CP config and caused OOM or a regression; 'hierarchical_context_parallel_sizes', 'a2a+p2p', 'hierarchical CP', 'CP beyond KV heads', 'multi-level CP'.

Hierarchical Context Parallel Skill

This skill covers hierarchical context parallelism: nested context-parallel process groups used by cp_comm_type="a2a+p2p" and configured with hierarchical_context_parallel_sizes.

For what hierarchical CP is, when to use it, and the decision tree (a2a+p2p vs pure a2a vs p2p), see:

@docs/training/hierarchical-context-parallel.md
@skills/perf-hierarchical-context-parallel/card.yaml

Enablement

Minimal Bridge override:

cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False

Required constraints:

prod(hierarchical_context_parallel_sizes) == context_parallel_size
seq_length % (2 * context_parallel_size) == 0
Transformer Engine >= 1.12.0

Code Anchors

Upstream config and validation:

context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""

hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify 
   the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
   groups of two levels, so the first value of the list indicates the group size of the a2a
   communication type, and the second value indicates the group size of the p2p communication
   type.
"""

if args.hierarchical_context_parallel_sizes:
    from numpy import prod
    assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
    assert args.hierarchical_context_parallel_sizes is not None, \
    "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"

Bridge MPU path:

parallel_state.initialize_model_parallel(
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    ...
)
...
return ProcessGroupCollection.use_mpu_process_groups()

Bridge decentralized-PG path:

pg_collection = ProcessGroupCollection(
    ...
    cp=cp_pg,
    tp_cp=tp_cp_pg,
    hcp=None,
    ep=ep_pg,
    ...
)

Implementation Map

Config definition

hierarchical_context_parallel_sizes is declared in ModelParallelConfig:

# 3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
hierarchical_context_parallel_sizes: Optional[list[int]] = None
# For a2a+p2p, first value = a2a group size, second value = p2p group size.
# Product must equal context_parallel_size.

cp_comm_type is declared in TransformerConfig:

# 3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
cp_comm_type: Optional[Union[str, List[str]]] = None
# Can be per-layer (List[str]) or uniform (str).
# Values: "p2p", "all_gather", "a2a", "a2a+p2p"

Validation (MCore)

TransformerConfig.__post_init__ enforces that a2a+p2p requires HCP sizes and the product matches CP.

Process group creation

TE integration

TEDotProductAttention passes the hierarchical groups to Transformer Engine when a2a+p2p is used. Requires Transformer Engine >= 1.12.0.

Pitfalls

Bridge HCP is MPU-only today: If use_decentralized_pg=True, Bridge initializes flat CP groups and leaves HCP unset.
No checked-in Bridge recipe currently exercises HCP directly.
Single-GPU load helpers clear hierarchical_context_parallel_sizes.
Silent broken training on old stacks: If you use a2a+p2p without setting hierarchical_context_parallel_sizes, MCore now asserts. Older versions would silently disable CP communication, so each rank attended only to its local chunk and produced artificially high throughput with broken gradients.
Product must match: prod(hierarchical_context_parallel_sizes) must exactly equal context_parallel_size. A mismatch triggers an assertion.
Verify in logs: Look for the process group initialization output. You should see HIERARCHICAL_CONTEXT_PARALLEL_GROUPS being created. If you only see CONTEXT_PARALLEL_GROUP, HCP is not active.

Verification

No dedicated Bridge end-to-end test exists yet for HCP (see @skills/perf-hierarchical-context-parallel/card.yaml follow_up_validation). Use the existing unit tests and log inspection instead.

Run the decentralized-PG unit test to confirm the flat-CP behavior is preserved:

uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q

For a manual smoke check, launch a 4-GPU run with a small recipe and cp_comm_type=a2a+p2p plus hierarchical_context_parallel_sizes=[2,2]:

CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.context_parallel_size=4 \
  model.cp_comm_type=a2a+p2p \
  "model.hierarchical_context_parallel_sizes=[2,2]" \
  train.train_iters=2

Success criteria:

Logs show HIERARCHICAL_CONTEXT_PARALLEL_GROUPS being created
Training completes at least one step without error
If you only see CONTEXT_PARALLEL_GROUP, HCP is not active

perf-hierarchical-context-parallel

Hierarchical Context Parallel Skill

Enablement

Code Anchors

Implementation Map

Config definition

Validation (MCore)

Process group creation

TE integration

Pitfalls

Verification

Mehr aus diesem Repository

Mehr aus diesem Repository

Hierarchical Context Parallel Skill

Enablement

Code Anchors

Implementation Map

Config definition

Validation (MCore)

Process group creation

TE integration

Pitfalls

Verification