원클릭으로 Manus에서 모든 스킬 실행

$pwd:

distributed-offline-packing

Name: Distributed Offline Packing
Author: EvolvingLMMs-Lab

// Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly

Manus에서 실행

$ git log --oneline --stat

stars:998

forks:72

updated:2026년 4월 28일 04:00

SKILL.md

readonly

related-skills.json

같은 저장소

llava-onevision2-consistency.md

from "EvolvingLMMs-Lab/LLaVA-OneVision-2"

Bilingual guide for running and interpreting LLaVA-OneVision2 HF vs Megatron consistency checks across TP and PP settings

2026-05-26998

merge-ov2.md

from "EvolvingLMMs-Lab/LLaVA-OneVision-2"

Bilingual guide for merging ViT + LLM into LlavaOnevision2 HF checkpoint and validating weight/inference consistency

2026-05-26998

commit-message.md

from "EvolvingLMMs-Lab/LLaVA-OneVision-2"

Guide for writing clear, consistent git commit messages following this repository's conventions

2026-05-06998

offline-packing-env-vars.md

from "EvolvingLMMs-Lab/LLaVA-OneVision-2"

Bilingual guide for the OFFLINE_PACKING_BMR and OFFLINE_PACKED_DATA environment variables that control LLaVA-OneVision2 training-side packing — what each gate does, why both must be enabled together, MBS=1 requirement, and the dead OFFLINE_PACKING_VQA branch

2026-05-06998

cu-lengths-attention-flow.md

from "EvolvingLMMs-Lab/LLaVA-OneVision-2"

Bilingual guide for understanding how cu_lengths controls attention behavior across ViT and LLM stages, and how patch_positions scope differs between the two

2026-03-26998

length-pool-sort-dataset.md

from "EvolvingLMMs-Lab/LLaVA-OneVision-2"

Bilingual guide for understanding LengthPoolSortDataset cross-rank length synchronization mechanism in multi-GPU training

2026-03-26998

package.json

"author": "EvolvingLMMs-Lab"

"repository": "EvolvingLMMs-Lab/LLaVA-OneVision-2"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

데이터 과학자컴퓨터 및 수학직15-2051L4

JSONL (N samples) ├─ s1_split_json_to_samples.py # validate + drop bad/missing-image samples │ # output: per-sample serialized records ├─ s2_compute_token_lengths.py # tokenize prompts/captions, compute image-patch tokens │ # output: length array per sample ├─ s3_bin_packing.py # BFD (Best-Fit-Decreasing) into bins of capacity L │ # output: bin assignment └─ s4_bins_to_webdataset.py # write tar shards + idx + .nv-meta/{dataset.yaml,split.yaml,sample_loader.py}

# Inside container, on any single node python offline_packing/s2_compute_token_lengths.py \ --jsonl <path/to/full.jsonl> \ --tokenizer <path/to/tokenizer> \ --image-processor Qwen2_5_VLProcessor \ --factor 48 --min-pixels 3136 --max-pixels 4000000 \ --output <path/to/token_lens.txt> # Then quickly inspect distribution (max, p99, count > L) before committing to L

cd <repo_root> bash offline_packing/auto_pipe.sh \ --jsonl <data_root>/part_00.jsonl \ --tokenizer <tokenizer_path> \ --image-processor Qwen2_5_VLProcessor \ --factor 48 --min-pixels 3136 --max-pixels 4000000 \ --image-root / \ --sample-class PackedCaptioningSample \ --shard-prefix <dataset_name>_a \ --output-dir <output_root>/node_a \ --seq-len 4096 \ --no-npy \ 2>&1 | tee <log_dir>/node_a.log

bash offline_packing/auto_pipe.sh \ --jsonl <data_root>/part_01.jsonl \ ... \ --shard-prefix <dataset_name>_b \ --output-dir <output_root>/node_b \ ... \ 2>&1 | tee <log_dir>/node_b.log

# Tar count per node should match s4 log ls <output_root>/node_a/webdataset/*.tar | wc -l ls <output_root>/node_b/webdataset/*.tar | wc -l # Bin count + capacity utilization printed by s3 grep -E "(bins|util|efficiency)" <log_dir>/node_a.log # Inspect one tar to confirm sample schema mkdir -p /tmp/tar_inspect && cd /tmp/tar_inspect tar -xf <output_root>/node_a/webdataset/<prefix>-000000.tar ls | head python -c "import json; d=json.load(open(open(__import__('glob').glob('*.json')[0]).name)); print(list(d.keys()))"

__module__: megatron.energon __class__: Metadataset splits: train: datasets: - weight: <num_samples_in_part_00> path: <output_root>/node_a/webdataset subflavors: augmentation: false - weight: <num_samples_in_part_01> path: <output_root>/node_b/webdataset subflavors: augmentation: false

bash offline_packing/auto_pipe.sh \ --jsonl <data_root>/part_01.jsonl \ ... \ --shard-prefix <dataset_name>_b \ --output-dir <output_root>/node_b \ ... \ 2>&1 | tee <log_dir>/node_b.log

name	distributed-offline-packing
description	Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly
compatibility	opencode
metadata	{"domain":"data-pipeline","framework":"llava-onevision2","repo":"llava-onevision2"}

name	distributed-offline-packing
description	Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly
compatibility	opencode
metadata	{"domain":"data-pipeline","framework":"llava-onevision2","repo":"llava-onevision2"}

distributed-offline-packing

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Purpose / 用途

Prerequisites / 前置条件

Architecture / 架构

Pipeline Stages / 流水线阶段

Key Design Decisions / 关键设计

Step-by-step Workflow / 操作步骤

0. Pre-flight: choose L / 预检：选 L

1. Split JSONL across nodes / 切分 JSONL

2. Mount NFS on every node / 挂载 NFS

3. Launch container on every node / 每台启动容器

4. Run auto_pipe.sh in parallel on each node / 并行启动

5. Verify outputs / 验证产物

6. Write top-level Metadataset yaml / 写顶层 Metadataset yaml

Common Pitfalls / 常见坑

Pitfall 1: forgetting --no-npy

Pitfall 2: same --shard-prefix on multiple nodes

Pitfall 3: starting fresh while old rm -rf still running on NFS

Pitfall 4: du -sh / rm -rf exceeding bash 120s timeout

Pitfall 5: container vs host timezone skew

Pitfall 6: choosing the wrong --sample-class

Pitfall 7: weight confusion in Metadataset yaml

Performance Reference / 性能参考

Quick Sanity Checklist / 快速自检清单

Related Files / 相关文件

Purpose / 用途

Prerequisites / 前置条件

Architecture / 架构

Pipeline Stages / 流水线阶段

Key Design Decisions / 关键设计

Step-by-step Workflow / 操作步骤

0. Pre-flight: choose L / 预检：选 L

1. Split JSONL across nodes / 切分 JSONL

2. Mount NFS on every node / 挂载 NFS

3. Launch container on every node / 每台启动容器

4. Run auto_pipe.sh in parallel on each node / 并行启动

5. Verify outputs / 验证产物

6. Write top-level Metadataset yaml / 写顶层 Metadataset yaml

Common Pitfalls / 常见坑

Pitfall 1: forgetting --no-npy

Pitfall 2: same --shard-prefix on multiple nodes

Pitfall 3: starting fresh while old rm -rf still running on NFS

Pitfall 4: du -sh / rm -rf exceeding bash 120s timeout

Pitfall 5: container vs host timezone skew

Pitfall 6: choosing the wrong --sample-class

Pitfall 7: weight confusion in Metadataset yaml

Performance Reference / 性能参考

Quick Sanity Checklist / 快速自检清单

Related Files / 相关文件

Pitfall 1: forgetting `--no-npy`

Pitfall 2: same `--shard-prefix` on multiple nodes

Pitfall 3: starting fresh while old `rm -rf` still running on NFS

Pitfall 4: `du -sh` / `rm -rf` exceeding bash 120s timeout

Pitfall 6: choosing the wrong `--sample-class`

Pitfall 7: `weight` confusion in Metadataset yaml

Pitfall 1: forgetting `--no-npy`

Pitfall 2: same `--shard-prefix` on multiple nodes

Pitfall 3: starting fresh while old `rm -rf` still running on NFS

Pitfall 4: `du -sh` / `rm -rf` exceeding bash 120s timeout

Pitfall 6: choosing the wrong `--sample-class`

Pitfall 7: `weight` confusion in Metadataset yaml