원클릭으로
distributed-offline-packing
// Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly
// Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly
Bilingual guide for running and interpreting LLaVA-OneVision2 HF vs Megatron consistency checks across TP and PP settings
Bilingual guide for merging ViT + LLM into LlavaOnevision2 HF checkpoint and validating weight/inference consistency
Guide for writing clear, consistent git commit messages following this repository's conventions
Bilingual guide for the OFFLINE_PACKING_BMR and OFFLINE_PACKED_DATA environment variables that control LLaVA-OneVision2 training-side packing — what each gate does, why both must be enabled together, MBS=1 requirement, and the dead OFFLINE_PACKING_VQA branch
Bilingual guide for understanding how cu_lengths controls attention behavior across ViT and LLM stages, and how patch_positions scope differs between the two
Bilingual guide for understanding LengthPoolSortDataset cross-rank length synchronization mechanism in multi-GPU training
| name | distributed-offline-packing |
| description | Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly |
| compatibility | opencode |
| metadata | {"domain":"data-pipeline","framework":"llava-onevision2","repo":"llava-onevision2"} |
Use this skill when packing a large SFT JSONL (hundreds of thousands to millions of samples) into Energon WebDataset shards at a fixed sequence length, using offline_packing/auto_pipe.sh parallelized across multiple nodes.
当需要把大规模 SFT JSONL(几十万到几百万样本)按固定序列长度打包成 Energon WebDataset shards,并通过 offline_packing/auto_pipe.sh 在多台机器上并行处理时,使用这个 skill。
offline_packing/auto_pipe.sh and stage scripts s1_split_json_to_samples.py … s4_bins_to_webdataset.pyimages, prompts, captions (multi-turn list-of-lists) and image paths are usable as-is所有节点共享同一个 NFS(数据+代码+输出)。每台机器使用同一个 docker 镜像,里面要有 transformers、energon、项目代码。需要 offline_packing/auto_pipe.sh 和 s1–s4 四个 stage 脚本。Tokenizer 和 image processor 是本地 HF 格式目录。源 JSONL 每行包含 images / prompts / captions(多轮是 list-of-list),图片路径可直接使用。
JSONL (N samples)
├─ s1_split_json_to_samples.py # validate + drop bad/missing-image samples
│ # output: per-sample serialized records
├─ s2_compute_token_lengths.py # tokenize prompts/captions, compute image-patch tokens
│ # output: length array per sample
├─ s3_bin_packing.py # BFD (Best-Fit-Decreasing) into bins of capacity L
│ # output: bin assignment
└─ s4_bins_to_webdataset.py # write tar shards + idx + .nv-meta/{dataset.yaml,split.yaml,sample_loader.py}
auto_pipe.sh runs all four stages sequentially on one node for one input file. To use N nodes, split the JSONL into N parts and run auto_pipe.sh independently on each — s3 BFD does NOT shard across nodes, so each node packs its own slice.
auto_pipe.sh 在一台机器上对一个输入文件顺序跑完四个 stage。要用 N 台机器就把 JSONL 切成 N 份,每台独立跑一次 auto_pipe.sh——s3 BFD 不能跨节点共享,所以每台只 pack 自己的那一份。
--shard-prefix <name>_<a|b|...> so tar files don't collide on shared NFSPackedCaptioningSample for image+text packed turns, MultiMixQASample for QA-style. The class controls the auto-generated sample_loader.py--no-npy flag: only set when JSONL has no precomputed patch_positions field. Without --no-npy s2 expects per-sample .npy files; missing files cause noisy warnings AND fall back to slow real-image tokenization<output_root>/node_<x>/webdataset/{*.tar, *.idx, .nv-meta/}Scan token lengths once on the full JSONL using s2_compute_token_lengths.py (or a quick standalone script). Pick L so that drop rate is acceptable (typically <0.1%).
# Inside container, on any single node
python offline_packing/s2_compute_token_lengths.py \
--jsonl <path/to/full.jsonl> \
--tokenizer <path/to/tokenizer> \
--image-processor Qwen2_5_VLProcessor \
--factor 48 --min-pixels 3136 --max-pixels 4000000 \
--output <path/to/token_lens.txt>
# Then quickly inspect distribution (max, p99, count > L) before committing to L
TOTAL=$(wc -l < full.jsonl)
HALF=$(( (TOTAL + 1) / 2 ))
split -l $HALF -d --additional-suffix=.jsonl full.jsonl part_
# produces part_00.jsonl, part_01.jsonl
For >2 nodes, adjust -l accordingly.
Make sure the same NFS is mounted on every node at the same path so paths in JSONL and outputs match.
Use the project's standard docker image with the repo bind-mounted. Working directory should be the repo root.
On node A:
cd <repo_root>
bash offline_packing/auto_pipe.sh \
--jsonl <data_root>/part_00.jsonl \
--tokenizer <tokenizer_path> \
--image-processor Qwen2_5_VLProcessor \
--factor 48 --min-pixels 3136 --max-pixels 4000000 \
--image-root / \
--sample-class PackedCaptioningSample \
--shard-prefix <dataset_name>_a \
--output-dir <output_root>/node_a \
--seq-len 4096 \
--no-npy \
2>&1 | tee <log_dir>/node_a.log
On node B (in parallel):
bash offline_packing/auto_pipe.sh \
--jsonl <data_root>/part_01.jsonl \
... \
--shard-prefix <dataset_name>_b \
--output-dir <output_root>/node_b \
... \
2>&1 | tee <log_dir>/node_b.log
[!IMPORTANT]
--shard-prefixmust differ between nodes so tar filenames don't collide--output-dirmust differ between nodes--image-root /if JSONL paths are absolute; otherwise set it to the image root prefix- Add
--no-npyif the JSONL has nopatch_positionsfield
# Tar count per node should match s4 log
ls <output_root>/node_a/webdataset/*.tar | wc -l
ls <output_root>/node_b/webdataset/*.tar | wc -l
# Bin count + capacity utilization printed by s3
grep -E "(bins|util|efficiency)" <log_dir>/node_a.log
# Inspect one tar to confirm sample schema
mkdir -p /tmp/tar_inspect && cd /tmp/tar_inspect
tar -xf <output_root>/node_a/webdataset/<prefix>-000000.tar
ls | head
python -c "import json; d=json.load(open(open(__import__('glob').glob('*.json')[0]).name)); print(list(d.keys()))"
Expected JSON top-level keys for PackedCaptioningSample:
images, prompts, captions, sample_count, patch_positions, timestamp_decimal.
patch_positions=[[""]] is normal under --no-npy — the auto-generated sample_loader.py handles it via sample.get(..., None).
__module__: megatron.energon
__class__: Metadataset
splits:
train:
datasets:
- weight: <num_samples_in_part_00>
path: <output_root>/node_a/webdataset
subflavors:
augmentation: false
- weight: <num_samples_in_part_01>
path: <output_root>/node_b/webdataset
subflavors:
augmentation: false
Notes / 注意:
weight is sample-count proportional (use the original input line count of each part, not the bin count)val split only if you actually need one; for SFT-only pipelines omit itsubflavors is optional; here we mark augmentation: false since data is already packed--no-npyWithout it s2 hunts for per-sample .npy files and either spams warnings OR falls back to the slow real-image tokenization path. Always check the source JSONL for a patch_positions field first.
--shard-prefix on multiple nodesTar filenames collide on shared NFS, second node overwrites the first. Use distinct suffixes (_a, _b, _n0, _n1, …).
rm -rf still running on NFSRemoving millions of small files from NFS can take 10+ minutes. Don't wait — use a fresh _v2 output dir, and mv (atomic rename on same FS) when done if you want the original name back.
du -sh / rm -rf exceeding bash 120s timeoutRun them as nohup ... & and poll the PID, or run them in tmux.
Container time may differ from host time but wall clock is the same. Don't be confused by log timestamps when comparing across docker exec sessions.
--sample-classThe class determines the auto-generated sample_loader.py and how downstream code unpacks tars. Confirm by reading the dataclass file (e.g. aiak_training_llm/data/multimodal/flavors/packed_captioning.py) and matching its fields to your JSONL shape.
weight confusion in Metadataset yamlUse sample counts (or proportional integers), not bin counts. Energon samples each dataset proportional to weight.
For ~390k samples per node at L=4096 on a multi-core machine with NFS storage:
rm is running)--no-npy): ~5 minTwo nodes in parallel ≈ same wall time as one node, so 2× throughput.
Typical packing efficiency at L=4096 with avg ~10 samples/bin: >99% capacity utilization.
Before declaring done:
sample_count in tar JSON > 0 (not all 1; means packing actually worked)dataset.yaml exists with absolute paths and correct weightsoffline_packing/auto_pipe.sh — pipeline driveroffline_packing/s1_split_json_to_samples.py — JSONL → per-sample recordsoffline_packing/s2_compute_token_lengths.py — token length computationoffline_packing/s3_bin_packing.py — BFD bin packingoffline_packing/s4_bins_to_webdataset.py — tar + idx + .nv-meta writeraiak_training_llm/data/multimodal/flavors/packed_captioning.py — PackedCaptioningSample dataclassaiak_training_llm/data/multimodal/flavors/multi_mix_qa.py — MultiMixQASample dataclassaiak_megatron/examples/multimodal/sft_dataset.yaml — Metadataset yaml template