Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

perf-sequence-packing

Name: Perf Sequence Packing
Author: NVIDIA-NeMo

// Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.

Exécuter dans Manus

$ git log --oneline --stat

stars:653

forks:324

updated:22 mai 2026 à 04:38

Explorateur de fichiers

2 fichiers

SKILL.md

readonly

name	perf-sequence-packing
description	Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.
when_to_use	Enabling sequence packing or long-context SFT, or investigating a commit that broke sequence packing or changed packing behavior; 'packed sequences', 'sequence packing', 'PackedSequenceSpecs', 'pack_sequences_in_batch', 'CP with packing'.

Sequence Packing Skill

For stable background and recommendation level, see:

@docs/training/packed-sequences.md
@skills/perf-sequence-packing/card.yaml

Enablement

Offline packed SFT for LLM finetuning:

from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs

cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
    packed_sequence_size=4096,
    pad_seq_to_mult=1,
)

If CP is enabled:

cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2

# If sequence_parallel is also enabled, use lcm(2*CP, CP*TP):
# import math
# cfg.dataset.packed_sequence_specs.pad_seq_to_mult = math.lcm(2 * CP, CP * TP)
# See src/megatron/bridge/training/vlm_step.py for reference logic.

If CUDA graphs are enabled for this packed path:

cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True

Note: pad_cu_seqlens = True also requires a metadata JSON file alongside the packed dataset (asserted in src/megatron/bridge/data/datasets/sft.py). Custom packed datasets that omit the metadata file will hit an assertion at dataset initialization.

In-batch packing for VLM finetuning:

cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2

Long-context baseline:

cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2

Code Anchors

LLM packed SFT config surface:

if packed_sequence:
    dataset_kwargs = {"pad_to_max_length": True}
    packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
else:
    dataset_kwargs = {}
    packed_sequence_specs = None

Bridge validation:

if self.model.context_parallel_size > 1:
    assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ...
    if isinstance(self.dataset, FinetuningDatasetConfig):
        assert self.model.calculate_per_token_loss, ...
        assert not self.ddp.average_in_collective, ...
...
if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1:
    raise ValueError(...)
...
if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1:
    raise ValueError(...)

VLM in-batch runtime:

if enable_packing:
    ...
    ) = pack_batch_sequences(
        ...
        pad_token_id=0,
        pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1,
    )

Packed THD runtime constraint:

if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1:
    raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)")

Pitfalls

Offline packed SFT and VLM in-batch packing are different features with opposite micro-batch rules.
When CP is enabled, packed sequence lengths must respect 2 * context_parallel_size divisibility.
For finetuning with CP, calculate_per_token_loss=True and ddp.average_in_collective=False are required.
pad_cu_seqlens=True also requires pad_to_max_length=True.
Packing support is model-family-specific. Qwen3-Next, GLM-4.5, and Qwen3.5-VL contain explicit opt-outs in different paths.
MTP finetuning is documented as incompatible with packed sequences.

Verification

Use the checked-in unit coverage:

uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \
uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \
uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -v

Success criteria:

first command reports 8 passed
second command reports 14 passed
third command reports 2 passed

related-skills.json

même dépôt

adding-model-support.md

from "NVIDIA-NeMo/Megatron-Bridge"

Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.

2026-05-22653

perf-hierarchical-context-parallel.md

from "NVIDIA-NeMo/Megatron-Bridge"

Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

2026-05-20653

perf-parallelism-strategies.md

from "NVIDIA-NeMo/Megatron-Bridge"

Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.

2026-05-20653

resiliency.md

from "NVIDIA-NeMo/Megatron-Bridge"

Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.

2026-05-20653

multi-node-slurm.md

from "NVIDIA-NeMo/Megatron-Bridge"

Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.

2026-05-18653

nemo-rl-e2e-testing.md

from "NVIDIA-NeMo/Megatron-Bridge"

External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GRPO/SFT/checkpoint/non-colocated refit variants, setting PYTHONPATH so NeMo-RL imports the local Bridge tree, and reporting pass/fail evidence.

2026-05-18653

package.json

"author": "NVIDIA-NeMo"

"repository": "NVIDIA-NeMo/Megatron-Bridge"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Scientifiques des donnéesProfessions informatiques et mathématiques15-2051L4

name	perf-sequence-packing
description	Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.
when_to_use	Enabling sequence packing or long-context SFT, or investigating a commit that broke sequence packing or changed packing behavior; 'packed sequences', 'sequence packing', 'PackedSequenceSpecs', 'pack_sequences_in_batch', 'CP with packing'.

Sequence Packing Skill

For stable background and recommendation level, see:

@docs/training/packed-sequences.md
@skills/perf-sequence-packing/card.yaml

Enablement

Offline packed SFT for LLM finetuning:

from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs

cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
    packed_sequence_size=4096,
    pad_seq_to_mult=1,
)

If CP is enabled:

cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2

# If sequence_parallel is also enabled, use lcm(2*CP, CP*TP):
# import math
# cfg.dataset.packed_sequence_specs.pad_seq_to_mult = math.lcm(2 * CP, CP * TP)
# See src/megatron/bridge/training/vlm_step.py for reference logic.

If CUDA graphs are enabled for this packed path:

cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True

In-batch packing for VLM finetuning:

cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2

Long-context baseline:

cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2

Code Anchors

LLM packed SFT config surface:

if packed_sequence:
    dataset_kwargs = {"pad_to_max_length": True}
    packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
else:
    dataset_kwargs = {}
    packed_sequence_specs = None

Bridge validation:

if self.model.context_parallel_size > 1:
    assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ...
    if isinstance(self.dataset, FinetuningDatasetConfig):
        assert self.model.calculate_per_token_loss, ...
        assert not self.ddp.average_in_collective, ...
...
if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1:
    raise ValueError(...)
...
if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1:
    raise ValueError(...)

VLM in-batch runtime:

if enable_packing:
    ...
    ) = pack_batch_sequences(
        ...
        pad_token_id=0,
        pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1,
    )

Packed THD runtime constraint:

if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1:
    raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)")

Pitfalls

Offline packed SFT and VLM in-batch packing are different features with opposite micro-batch rules.
When CP is enabled, packed sequence lengths must respect 2 * context_parallel_size divisibility.
For finetuning with CP, calculate_per_token_loss=True and ddp.average_in_collective=False are required.
pad_cu_seqlens=True also requires pad_to_max_length=True.
Packing support is model-family-specific. Qwen3-Next, GLM-4.5, and Qwen3.5-VL contain explicit opt-outs in different paths.
MTP finetuning is documented as incompatible with packed sequences.

Verification

Use the checked-in unit coverage:

uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \
uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \
uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -v

Success criteria:

first command reports 8 passed
second command reports 14 passed
third command reports 2 passed

perf-sequence-packing

Sequence Packing Skill

Enablement

Code Anchors

Pitfalls

Verification

Plus depuis ce dépôt

Plus depuis ce dépôt

Sequence Packing Skill

Enablement

Code Anchors

Pitfalls

Verification