com um clique
length-pool-sort-dataset
// Bilingual guide for understanding LengthPoolSortDataset cross-rank length synchronization mechanism in multi-GPU training
// Bilingual guide for understanding LengthPoolSortDataset cross-rank length synchronization mechanism in multi-GPU training
Bilingual guide for running and interpreting LLaVA-OneVision2 HF vs Megatron consistency checks across TP and PP settings
Bilingual guide for merging ViT + LLM into LlavaOnevision2 HF checkpoint and validating weight/inference consistency
Guide for writing clear, consistent git commit messages following this repository's conventions
Bilingual guide for the OFFLINE_PACKING_BMR and OFFLINE_PACKED_DATA environment variables that control LLaVA-OneVision2 training-side packing — what each gate does, why both must be enabled together, MBS=1 requirement, and the dead OFFLINE_PACKING_VQA branch
Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly
Bilingual guide for understanding how cu_lengths controls attention behavior across ViT and LLM stages, and how patch_positions scope differs between the two
| name | length-pool-sort-dataset |
| description | Bilingual guide for understanding LengthPoolSortDataset cross-rank length synchronization mechanism in multi-GPU training |
| compatibility | opencode |
| metadata | {"domain":"distributed-training","framework":"megatron-energon","repo":"llava-onevision2"} |
Use this skill when analyzing, debugging, or tuning LengthPoolSortDataset — the cross-rank length synchronization mechanism used in this repository's training pipeline.
在分析、调试或调优本仓库训练 pipeline 中的跨 rank 长度同步机制 LengthPoolSortDataset 时,使用这个 skill。
This skill is specifically for:
aiak_training_llm/data/multimodal/length_sort_dataset.pylength_sort_pool_size > 0pool_size for optimal multi-GPU efficiency这个 skill 专门用于:
aiak_training_llm/data/multimodal/length_sort_dataset.pylength_sort_pool_size > 0 时训练速度提升pool_size 以获得最佳多卡效率上游 dataset → 累积 pool_size 个 sample → 按序列长度排序 → 用确定性 seed shuffle → 逐个 yield
for batch_idx, sample in enumerate(self.dataset):
pool.append(sample)
if len(pool) >= self.pool_size:
pool.sort(key=self.key_fn) # 1. 按长度排序
shuffle_seed = 42 + batch_idx # 2. 确定性 seed
random.Random(shuffle_seed).shuffle(pool) # 3. 同 seed shuffle
for s in pool:
yield s
pool.clear()
CrudeWebdataset → ShuffleBuffer → cook_crude_sample → encode_sample
→ LengthPoolSortDataset → BatchDataset → EpochizeDataset → LogSampleDataset
Inserted after encode_sample (where total_len / tokens are available) and before BatchDataset.
插在 encode_sample 之后(此时已有 total_len / tokens)、BatchDataset 之前。
Activated by: --length-sort-pool-size N (where N > 0).
通过 --length-sort-pool-size N(N > 0)激活。
In multi-GPU data-parallel training, all ranks must synchronize at each step (gradient all-reduce). If different ranks process samples of very different lengths, fast ranks idle waiting for slow ranks.
多卡数据并行训练中,所有 rank 每步都要同步(梯度 all-reduce)。如果不同 rank 处理的 sample 长度差异很大,快的 rank 空等慢的 rank。
无 pool sort:
Rank 0 step 100: 长度 200 → 0.5s ┐
Rank 1 step 100: 长度 5000 → 3.0s ├→ 所有 rank 等 3.0s
Rank 2 step 100: 长度 300 → 0.6s ┘
Sort + same-seed shuffle ensures all ranks yield samples of approximately the same length at the same time.
排序 + 同 seed shuffle 保证所有 rank 在同一时刻输出近似相同长度的 sample。
有 pool sort:
Rank 0 step 100: 长度 ~200 → 0.5s ┐
Rank 1 step 100: 长度 ~200 → 0.5s ├→ 所有 rank 等 0.5s
Rank 2 step 100: 长度 ~200 → 0.5s ┘
Each rank's data comes from different shards of the same dataset, so length distributions are approximately identical. After sorting, the i-th position in each rank's pool corresponds to the i-th quantile of its length distribution. Similar distributions → similar quantiles → similar lengths.
各 rank 的数据来自同一数据集的不同 shard,长度分布近似相同。排序后,各 rank pool 中第 i 个位置对应各自长度分布的第 i 个分位数。分布相似 → 分位数相似 → 长度相似。
Rank 0 排序后: [100, 102, 105, 108, ..., 4998, 5000]
Rank 1 排序后: [101, 103, 106, 109, ..., 4999, 5001]
Rank 2 排序后: [ 99, 104, 107, 110, ..., 4997, 5002]
shuffle_seed = 42 + batch_idx is identical across all ranks → same permutation applied to all pools. Since the i-th position had similar lengths, the shuffled output at the same position still has similar lengths.
shuffle_seed = 42 + batch_idx 对所有 rank 相同 → 相同排列应用于所有 pool。由于第 i 个位置的长度相似,shuffle 后同一位置的输出长度仍然相似。
permutation = [3, 0, 4, 1, 2]:
Rank 0 输出: [108, 100, 5000, 102, 105]
Rank 1 输出: [109, 101, 5001, 103, 106]
Rank 2 输出: [110, 99, 5002, 104, 107]
→ 同一位置长度近似一致
| 方案 | 效果 |
|---|---|
| 只 sort 不 shuffle | 所有 rank 先跑短 sample 后跑长 sample。前期 step 极快,后期 step 极慢。训练动态不稳定 |
| sort + shuffle | 各 step 长度随机但 rank 间一致。step 耗时平稳,rank 间同步开销小 |
| 不 sort 不 shuffle | 各 rank 同一 step 长度随机且不一致,快的等慢的 |
Sort solves cross-rank synchronization. Shuffle solves temporal uniformity. Both are required.
排序解决跨 rank 同步问题,shuffle 解决时序均匀性问题。二者缺一不可。
| pool_size | 跨 rank 长度同步精度 | 内存开销 | 首个 pool 输出延迟 |
|---|---|---|---|
| 小(~100) | 较差,各 rank 分位数估计不准 | 低 | 低 |
| 中(~1000-10000) | 好,推荐范围 | 中 | 中 |
| 大(~全量) | 完美同步 | 高 | 高 |
| 极限 = dataset 大小 | 等价全局排序 | 不实际 | 不实际 |
Larger pool_size → more accurate quantile estimation across ranks → better synchronization → less idle time.
pool_size 越大 → 各 rank 分位数估计越准 → 同步越好 → 空等越少。
Rule of thumb: pool_size should be significantly larger than batch_size, ideally 10x-100x.
经验法则:pool_size 应远大于 batch_size,理想情况下 10x-100x。
When num_workers > 1, each worker runs an independent LengthPoolSortDataset instance with its own pool.
当 num_workers > 1 时,每个 worker 运行独立的 LengthPoolSortDataset 实例,各自维护独立 pool。
Correctness: No issue. Each worker sorts its own shard subset independently.
Synchronization effect: Diluted. DataLoader interleaves outputs from multiple workers, partially disrupting the within-pool length ordering.
Recommendation: num_workers=1 gives the strongest synchronization effect. If num_workers > 1 is needed for I/O throughput, increase pool_size proportionally.
正确性:无问题。每个 worker 独立排序自己的 shard 子集。
同步效果:被稀释。DataLoader 交替取多个 worker 的输出,部分打乱 pool 内的长度顺序。
建议:num_workers=1 同步效果最强。如果需要 num_workers > 1 提升 I/O 吞吐,可按比例增大 pool_size。
save_state / restore_state delegate directly to the upstream dataset. The pool's internal state (accumulated but not yet yielded samples) is NOT saved.
save_state / restore_state 直接委托给上游 dataset。pool 内部状态(已累积但未 yield 的 sample)不会被保存。
On resume, up to pool_size - 1 samples may be re-ordered differently or skipped. This is generally negligible for large datasets.
恢复时,最多 pool_size - 1 个 sample 可能会被不同地排序或跳过。对于大数据集来说通常可以忽略。
key_fn=lambda s: getattr(s, "total_len", len(getattr(s, "tokens")))
Prefers total_len attribute (set by encode_sample), falls back to len(tokens).
优先使用 total_len 属性(由 encode_sample 设置),回退到 len(tokens)。
Is --length-sort-pool-size actually set and > 0?
What is the actual length distribution of the dataset? (High variance → more benefit from pool sort)
Are all ranks receiving data from the same underlying dataset with similar length distributions?
Is num_workers > 1 diluting the synchronization effect?
After enabling, did step time variance across steps decrease?
Is memory usage acceptable for the chosen pool_size?
--length-sort-pool-size 是否真正设置且 > 0?
数据集的实际长度分布如何?(方差越大 → pool sort 收益越大)
所有 rank 是否从同一底层数据集接收长度分布相似的数据?
num_workers > 1 是否稀释了同步效果?
启用后,各 step 的 step time 方差是否降低了?
所选 pool_size 的内存占用是否可接受?
When asked to analyze a training speed issue related to length sorting, return:
当被要求分析与长度排序相关的训练速度问题时,应返回:
Whether LengthPoolSortDataset is active and with what pool_size
The dataset's length distribution characteristics (high/low variance)
The relationship between pool_size, batch_size, and num_workers
Whether cross-rank synchronization is the bottleneck
Recommended pool_size adjustment if needed
LengthPoolSortDataset 是否激活,pool_size 是多少
数据集的长度分布特征(高/低方差)
pool_size、batch_size、num_workers 之间的关系
跨 rank 同步是否是瓶颈
如需要,推荐的 pool_size 调整