en un clic
perf-sequence-packing
// Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.
// Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.
Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples.
Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GRPO/SFT/checkpoint/non-colocated refit variants, setting PYTHONPATH so NeMo-RL imports the local Bridge tree, and reporting pass/fail evidence.
| name | perf-sequence-packing |
| description | Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints. |
| when_to_use | Enabling sequence packing or long-context SFT, or investigating a commit that broke sequence packing or changed packing behavior; 'packed sequences', 'sequence packing', 'PackedSequenceSpecs', 'pack_sequences_in_batch', 'CP with packing'. |
For stable background and recommendation level, see:
Offline packed SFT for LLM finetuning:
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs
cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
packed_sequence_size=4096,
pad_seq_to_mult=1,
)
If CP is enabled:
cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
# If sequence_parallel is also enabled, use lcm(2*CP, CP*TP):
# import math
# cfg.dataset.packed_sequence_specs.pad_seq_to_mult = math.lcm(2 * CP, CP * TP)
# See src/megatron/bridge/training/vlm_step.py for reference logic.
If CUDA graphs are enabled for this packed path:
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True
Note: pad_cu_seqlens = True also requires a metadata JSON file alongside
the packed dataset (asserted in src/megatron/bridge/data/datasets/sft.py).
Custom packed datasets that omit the metadata file will hit an assertion at
dataset initialization.
In-batch packing for VLM finetuning:
cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2
Long-context baseline:
cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2
LLM packed SFT config surface:
if packed_sequence:
dataset_kwargs = {"pad_to_max_length": True}
packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
else:
dataset_kwargs = {}
packed_sequence_specs = None
Bridge validation:
if self.model.context_parallel_size > 1:
assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ...
if isinstance(self.dataset, FinetuningDatasetConfig):
assert self.model.calculate_per_token_loss, ...
assert not self.ddp.average_in_collective, ...
...
if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1:
raise ValueError(...)
...
if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1:
raise ValueError(...)
VLM in-batch runtime:
if enable_packing:
...
) = pack_batch_sequences(
...
pad_token_id=0,
pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1,
)
Packed THD runtime constraint:
if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1:
raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)")
2 * context_parallel_size divisibility.calculate_per_token_loss=True and ddp.average_in_collective=False are required.pad_cu_seqlens=True also requires pad_to_max_length=True.Qwen3-Next, GLM-4.5, and Qwen3.5-VL contain explicit opt-outs in different paths.Use the checked-in unit coverage:
uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \
uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \
uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -v
Success criteria:
8 passed14 passed2 passed