| name | offline-packing-env-vars |
| description | Bilingual guide for the OFFLINE_PACKING_BMR and OFFLINE_PACKED_DATA environment variables that control LLaVA-OneVision2 training-side packing โ what each gate does, why both must be enabled together, MBS=1 requirement, and the dead OFFLINE_PACKING_VQA branch |
| compatibility | opencode |
| metadata | {"domain":"training-pipeline","framework":"llava-onevision2","repo":"llava-onevision2"} |
Purpose / ็จ้
Use this skill when you set up or debug training-side sample packing for LLaVA-OneVision2 โ i.e. when you need to decide which env vars to export in a training shell script (Stage-1 / Stage-1.5 / Stage-2) and want to understand why both OFFLINE_PACKING_BMR and OFFLINE_PACKED_DATA must be 1 to actually get padding-free attention.
ๅจ้
็ฝฎๆ่ฐ่ฏ LLaVA-OneVision2 ่ฎญ็ปไพง็ๆ ทๆฌ packing ๆถไฝฟ็จโโๆฏๅฆ่ฆๅณๅฎๅจ่ฎญ็ป shell ่ๆฌ๏ผStage-1 / Stage-1.5 / Stage-2๏ผไธญๅฏผๅบๅชไบ็ฏๅขๅ้๏ผไปฅๅไธบไปไนๅฟ
้กป OFFLINE_PACKING_BMR=1 ๅ OFFLINE_PACKED_DATA=1 ๅๆถๆๅผๆ่ฝ็ๆญฃ่ทๅพ padding-free ็ attentionใ
This skill is specifically for:
- Choosing the correct env var combination in training scripts
- Diagnosing cross-sample attention leakage in packed runs
- Understanding why
cu_lengths is a dummy [[0]] in some runs and a real [B, P+1] tensor in others
- Avoiding the well-known
OFFLINE_PACKING_VQA red herring (it is dead code)
Companion skill: cu-lengths-attention-flow covers the consumer side (how cu_lengths is fed into ViT/LLM attention). This skill covers the producer + gate side.
ๅงๅฆน skill๏ผcu-lengths-attention-flow ่ฎฒๆถ่ดน็ซฏ๏ผcu_lengths ๅฆไฝ้ๅ
ฅ ViT/LLM attention๏ผใๆฌ skill ่ฎฒ็ไบง็ซฏ + ๅผๅ
ณใ
TL;DR / ไธๅฅ่ฏๆป็ป
For packed training to work end-to-end, both env vars must be 1:
export OFFLINE_PACKING_BMR='1'
export OFFLINE_PACKED_DATA='1'
Setting only one is a silent bug. OFFLINE_PACKING_VQA is dead code; do not rely on it.
The Three Env Vars / ไธไธช็ฏๅขๅ้็็ธ่กจ
| Env var | Status | Default | Read at | Effect |
|---|
OFFLINE_PACKING_BMR | ALIVE | 0 | aiak_training_llm/data/multimodal/task_encoder.py:194 | Inside PackedCaptioningSample handling, unroll each packed entry into a MultiMixQASample (BMR-style, with full prompt/caption messages). When 0, falls through to the legacy CaptioningSample branch which loses the multi-turn structure. |
OFFLINE_PACKED_DATA | ALIVE | 0 | aiak_training_llm/data/multimodal/task_encoder.py:363 | Inside batch(), replace dummy cu_lengths = [[0]] with the real per-sample s.cu_lengths stacked across the batch. Without this, the consumer side cannot construct PackedSeqParams. |
OFFLINE_PACKING_VQA | DEAD | n/a | nowhere in aiak_training_llm/ | Mentioned in README + several legacy shells under examples/llava_onevision1_5/ and examples/llava_onevision2/quick_start_video_2b/, but no source file reads it. Setting it has zero runtime effect. Treat as documentation noise. |
๐ก The OFFLINE_PACKING_VQA red herring is the #1 source of confusion. Newcomers see it in shell scripts and assume it controls VQA packing. It does not. There is no third packing branch in task_encoder.py โ only the BMR branch and the legacy captioning fallback.
๐ก OFFLINE_PACKING_VQA ่ฟไธช็บข้ฒฑ้ฑผๆฏๅคดๅทๅฐๆๆบใๆฐไบบๅจ shell ่ๆฌ้็ๅฐๅฎ๏ผไปฅไธบๅฎๆงๅถ VQA packingใๅนถไธใtask_encoder.py ้ๆฒกๆ็ฌฌไธไธช packing ๅๆฏโโๅชๆ BMR ๅๆฏๅ่็ captioning fallbackใ
Two-Stage Gate Architecture / ไธคๆฎตๅผ Gate ๆถๆ
Packing in this codebase is split into two orthogonal gates that must both fire. Understanding this is the whole point of the skill.
ๆฌไปๅบ็ packing ๆๆไธคไธชๆญฃไบค็ gate๏ผๅฟ
้กป้ฝ่งฆๅใ็่งฃ่ฟไธ็นๅฐฑๆฏๆฌ skill ็ๆ ธๅฟใ
Gate 1 โ Data Layer (OFFLINE_PACKING_BMR)
Where: aiak_training_llm/data/multimodal/task_encoder.py, inside the PackedCaptioningSample branch of the encoder dispatch (encode_sample ~line 186).
What it does:
- For each entry inside the packed sample (
for idx in range(n_orig_sample):), if OFFLINE_PACKING_BMR == 1, it builds a MultiMixQASample carrying the full chat-format messages ({role: user, content: prompt}, {role: assistant, content: caption}) and routes it through encode_multi_mix_qa().
- If
OFFLINE_PACKING_BMR != 1, it falls back to a plain CaptioningSample and encode_captioning() โ losing the multi-turn / multi-image structure required for SFT.
- After the per-entry loop, regardless of the BMR flag, it calls
self.pack_selected_samples(l_Qwen2VLImageTaskSample) (line 277), which constructs the per-sub-sample cumulative lengths cu_lengths = [0, lenโ, lenโ+lenโ, ...] and attaches them to the resulting ImageTaskSamplePacked (line 473).
Net effect: enables the correct per-sub-sample encoding and produces real s.cu_lengths on each sample.
ไฝ็จ๏ผๅฏ็จๆญฃ็กฎ็้ๅญๆ ทๆฌ็ผ็ ๏ผๅนถๅจๆฏไธชๆ ทๆฌไธไบงๅบ็ๆญฃ็ s.cu_lengthsใ
โ ๏ธ Even with BMR off, pack_selected_samples still attaches a cu_lengths tensor to the sample. But the sub-samples were encoded via the wrong path (legacy captioning), so the resulting boundaries don't match what the LLM actually sees. BMR off + PACKED_DATA on is a hidden corruption, not just a missing-feature.
โ ๏ธ ๅณไฝฟ BMR ๅ
ณๆ๏ผpack_selected_samples ไป็ถไผ็ปๆ ทๆฌๆไธ cu_lengths ๅผ ้ใไฝๅญๆ ทๆฌ่ตฐ็ๆฏ้่ฏฏ็็ผ็ ่ทฏๅพ๏ผ่ captioning๏ผ๏ผ็ปๆ boundary ๅ LLM ๅฎ้
็ๅฐ็ token ๅบๅๅฏนไธไธใBMR ๅ
ณ + PACKED_DATA ๅผๆฏ้ๆงๆฐๆฎๆๅ๏ผไธๅชๆฏ็ผบ็นๆงใ
Gate 2 โ Batch Layer (OFFLINE_PACKED_DATA)
Where: aiak_training_llm/data/multimodal/task_encoder.py:359-365, inside batch() (the collate function).
What it does:
cu_lengths = torch.tensor([[0]], dtype=torch.int32)
max_lengths = torch.tensor([[0]], dtype=torch.int32)
if self.is_packing_enabled or int(os.environ.get("OFFLINE_PACKED_DATA", 0)) == 1:
cu_lengths = torch.stack([s.cu_lengths for s in samples])
max_lengths = torch.tensor([s.max_length for s in samples], dtype=torch.int32)
- Default: emit a dummy
cu_lengths of shape [1, 1] containing only [[0]].
- When
OFFLINE_PACKED_DATA == 1 (or the energon online-packing flag is set): stack the real per-sample cu_lengths produced by Gate 1 into shape [B, P+1].
Net effect: decides whether the consumer (model forward) sees real packing offsets or a dummy that says "no packing".
ไฝ็จ๏ผๅณๅฎๆถ่ดน็ซฏ๏ผๆจกๅ forward๏ผ็ๅฐ็ๆฏ็ๅฎ็ packing ๅ็งป๏ผ่ฟๆฏไธไธช่กจ็คบ"ๆฒกๆ packing"็ dummyใ
Why both gates must fire / ไธบไปไนๅฟ
้กปไธคไธช้ฝๅผ
The consumer side at aiak_training_llm/train/pretrain/pretrain_llava_onevision2.py:153-168:
packed_seq_params = None
...
if cu_lengths.shape == torch.Size([1, 1]):
pass
else:
assert cu_lengths.shape[0] == 1, "micro-batch-size must be 1 for packing"
packed_seq_params = PackedSeqParams(
qkv_format="thd",
cu_seqlens_q=cu_lengths[0],
cu_seqlens_kv=cu_lengths[0],
...
)
So:
| BMR | PACKED_DATA | Result |
|---|
| 0 | 0 | No packing. Each sample treated independently. Slow but correct (if data is unpacked). |
| 1 | 0 | SILENT BUG. Data is encoded as packed sub-samples (BMR), cu_lengths is built, but batch() discards it as dummy [[0]]. Consumer sees shape == [1,1] โ packed_seq_params = None โ flash-attn applies a single causal mask across the entire packed sequence โ cross-sub-sample attention leakage. Loss looks fine; model silently learns wrong attention. |
| 0 | 1 | HIDDEN CORRUPTION. Sub-samples encoded via legacy path, boundaries in cu_lengths don't align with token sequence. Consumer applies varlen attention with wrong offsets. |
| 1 | 1 | CORRECT. BMR encodes properly, PACKED_DATA forwards the real offsets, consumer builds PackedSeqParams, flash-attn applies per-sub-sample causal mask via cu_seqlens_q/kv. |
๐ฅ The "BMR=1, PACKED_DATA=0" footgun is the most dangerous combination. Training does not crash. Loss curves look reasonable. But every sub-sample in a packed sequence can attend to every other sub-sample's prefix. Use this skill's TL;DR snippet to avoid it.
๐ฅ "BMR=1, PACKED_DATA=0" ่ฟไธช็ปๅๆๅฑ้ฉใ่ฎญ็ปไธไผๆ๏ผloss ๆฒ็บฟ็็ไนๆญฃๅธธใไฝ packed ๅบๅ้ๆฏไธชๅญๆ ทๆฌ้ฝ่ฝ attend ๅฐๅซ็ๅญๆ ทๆฌ็ prefixใ็จๆฌ skill ้กถ้จ็ TL;DR ็ๆฎต้ฟๅผๅฎใ
MBS=1 Hard Requirement / MBS=1 ็กฌๆง่ฆๆฑ
pretrain_llava_onevision2.py:157:
assert cu_lengths.shape[0] == 1, "micro-batch-size must be 1 for packing"
When packing is on, cu_lengths has shape [B, P+1] where B = micro_batch_size and P = number of sub-samples in a packed sequence. The current PackedSeqParams construction only handles B=1 (it indexes cu_lengths[0]). Therefore:
--micro-batch-size 1 is mandatory for any packed training run.
- Increase throughput via
--global-batch-size (gradient accumulation), pipeline parallelism, or longer --seq-length, not via MBS.
- If you forget, the assert fires immediately on the first batch.
ๆๅผ packing ๆถ๏ผcu_lengths ๅฝข็ถๆฏ [B, P+1]๏ผB = micro batch size๏ผP = ไธไธช packed ๅบๅ้็ๅญๆ ทๆฌๆฐใๅฝๅ PackedSeqParams ๆ้ ๅชๅค็ B=1๏ผๅ cu_lengths[0]๏ผใๆไปฅ๏ผ
- ๆๅ
่ฎญ็ปๅฟ
้กป
--micro-batch-size 1ใ
- ๆณๆๅๅๅฐฑ่ฐ
--global-batch-size๏ผๆขฏๅบฆ็ดฏ็งฏ๏ผใPP ๅนถ่กใๆๆด้ฟ็ --seq-length๏ผไธ่ฆ่ฐ MBSใ
- ๅฟไบ็่ฏ็ฌฌไธไธช batch ๅฐฑ assert ๆๆใ
End-to-End Flow / ็ซฏๅฐ็ซฏๆต็จๅพ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Offline preprocessing (auto_pipe.sh, separate skill) โ
โ Produces WebDataset shards with PackedCaptioningSample format โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Energon dataloader yields PackedCaptioningSample
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ task_encoder.encode_sample() โ
โ if OFFLINE_PACKING_BMR == 1: โโโ GATE 1 โ
โ for each sub-sample โ MultiMixQASample โ encode_multi_mix_qa โ
โ else: โ
โ for each sub-sample โ CaptioningSample โ encode_captioning โ
โ pack_selected_samples(l_samples) โ
โ โ ImageTaskSamplePacked with cu_lengths=[0,L1,L1+L2,...] โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ task_encoder.batch() โ
โ if is_packing_enabled or OFFLINE_PACKED_DATA==1: โโโ GATE 2 โ
โ cu_lengths = stack([s.cu_lengths for s in samples]) โ
โ else: โ
โ cu_lengths = [[0]] # dummy, signals "not packed" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ batch dict broadcast via tensor_parallel
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ pretrain_llava_onevision2.get_batch_on_this_tp_rank() โ
โ if cu_lengths.shape == [1,1]: packed_seq_params = None โ
โ else: โ
โ assert cu_lengths.shape[0] == 1 # MBS=1 required โ
โ packed_seq_params = PackedSeqParams( โ
โ qkv_format="thd", โ
โ cu_seqlens_q=cu_lengths[0], โ
โ cu_seqlens_kv=cu_lengths[0], ...) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Model forward โ flash-attn varlen
(see cu-lengths-attention-flow skill)
Recipe: Correct Stage-N Script Snippet / ๆญฃ็กฎ็่ฎญ็ป่ๆฌ็ๆฎต
export OFFLINE_PACKING_BMR='1'
export OFFLINE_PACKED_DATA='1'
MBS=1
For an A/B control run that uses the same packed dataset but disables packing semantics (to measure the leakage cost), set both to '0'. Setting only BMR=1 or only PACKED_DATA=1 is not a valid configuration โ it is a bug.
ๅฆๆๆณๅ A/B ๅฏน็
ง๏ผ็จๅไธไปฝ packed ๆฐๆฎไฝๅ
ณ้ญ packing ่ฏญไน๏ผไธบไบ้ๅ leakage ๆๅคฑ๏ผ๏ผไธคไธช้ฝ่ฎพ '0'ใๅชๅผไธไธชไธๆฏๅๆณ้
็ฝฎ๏ผๆฏ bugใ
Concrete Stage-1 A/B Pair (this repo) / ๆฌไปๅบ็ Stage-1 A/B ๅฏน็
ง
examples/llava_onevision2/quick_start_4b/stage_1_alignment_p16m3_packed.sh โ production: BMR=1, PACKED_DATA=1.
examples/llava_onevision2/quick_start_4b/stage_1_alignment_p16m3_packed_bmr_only.sh โ A/B control: BMR=1, PACKED_DATA=0. Note: this is the dangerous combo described above; it is named bmr_only deliberately to study the leakage effect, not as a recommended setting.
If you copy _bmr_only.sh for a real production run, you will get cross-sub-sample attention leakage. Always confirm intent.
ๅฆๆไฝ ๆ _bmr_only.sh ๆทๅปๅๆญฃๅผ่ฎญ็ป๏ผๅฐฑไผๅพๅฐ่ทจๅญๆ ทๆฌ attention leakageใๅกๅฟ
็กฎ่ฎคๆฏๆๆไธบไนใ
Diagnostics / ๆๆฅๆธ
ๅ
If your packed training looks "off" (loss too low / too smooth / model overfits prefixes):
grep -n 'OFFLINE_PACKING_BMR\|OFFLINE_PACKED_DATA' your_script.sh โ both should be '1'.
- Add a one-shot print in
task_encoder.batch() after line 365: print('cu_lengths.shape:', cu_lengths.shape). Expect [1, P+1] with P >= 2. If you see [1, 1], Gate 2 is closed.
- Add a print in
pretrain_llava_onevision2.py after line 168: print('packed_seq_params:', packed_seq_params). Should be a real PackedSeqParams, not None.
- Confirm
MBS=1 in the shell (--micro-batch-size 1). Otherwise the assert at line 157 fires and you wouldn't be reading this.
- Confirm dataset is actually packed:
cat $DATA_PATH/.../webdataset/.nv-meta/.info.yaml โ look for shard structure produced by auto_pipe.sh (PackedCaptioningSample).
- Do not add
OFFLINE_PACKING_VQA=1 thinking it helps. It does nothing in this codebase.
Cross-References / ไบคๅๅผ็จ
- Producer pipeline (how the packed shards are built):
distributed-offline-packing skill.
- Consumer attention semantics (how
cu_lengths is interpreted by ViT and LLM): cu-lengths-attention-flow skill.
- Dataloader length-balancing across ranks:
length-pool-sort-dataset skill.
Source File Index / ๆบๆไปถ็ดขๅผ
| File | Lines | What |
|---|
aiak_training_llm/data/multimodal/task_encoder.py | 186-279 | PackedCaptioningSample branch + Gate 1 (OFFLINE_PACKING_BMR) |
aiak_training_llm/data/multimodal/task_encoder.py | 359-365 | batch() Gate 2 (OFFLINE_PACKED_DATA) |
aiak_training_llm/data/multimodal/task_encoder.py | 401-477 | pack_selected_samples โ builds real cu_lengths |
aiak_training_llm/train/pretrain/pretrain_llava_onevision2.py | 145-168 | Consumer: cu_lengths.shape check + PackedSeqParams construction + MBS=1 assert |
aiak_training_llm/train/pretrain/pretrain_llava_onevision2.py | 171-207 | SP padding for packed_seq_params (TP/SP-only path) |