con un clic
megatron-checkpoint-layout
// Bilingual guidance for Megatron checkpoint 1D 2D 3D mp_rank layouts across tensor pipeline and expert parallel dimensions
// Bilingual guidance for Megatron checkpoint 1D 2D 3D mp_rank layouts across tensor pipeline and expert parallel dimensions
Bilingual guide for running and interpreting LLaVA-OneVision2 HF vs Megatron consistency checks across TP and PP settings
Bilingual guide for merging ViT + LLM into LlavaOnevision2 HF checkpoint and validating weight/inference consistency
Guide for writing clear, consistent git commit messages following this repository's conventions
Bilingual guide for the OFFLINE_PACKING_BMR and OFFLINE_PACKED_DATA environment variables that control LLaVA-OneVision2 training-side packing — what each gate does, why both must be enabled together, MBS=1 requirement, and the dead OFFLINE_PACKING_VQA branch
Bilingual guide for running offline_packing/auto_pipe.sh across multiple nodes to produce padding-free packed WebDataset shards for SFT, with Energon Metadataset assembly
Bilingual guide for understanding how cu_lengths controls attention behavior across ViT and LLM stages, and how patch_positions scope differs between the two
| name | megatron-checkpoint-layout |
| description | Bilingual guidance for Megatron checkpoint 1D 2D 3D mp_rank layouts across tensor pipeline and expert parallel dimensions |
| compatibility | opencode |
| metadata | {"domain":"distributed-training","framework":"megatron","repo":"llava-onevision2"} |
Use this skill when diagnosing, designing, or converting Megatron/Megatron-Core checkpoints that may use TP, PP, and EP.
在排查、设计或转换使用 TP、PP、EP 的 Megatron / Megatron-Core checkpoint 时,使用这个 skill。
TP only: mp_rank_{tp}
TP + PP: mp_rank_{tp}_{pp}
TP + PP + EP: mp_rank_{tp}_{pp}_{ep}
只有 TP:mp_rank_{tp}
TP + PP:mp_rank_{tp}_{pp}
TP + PP + EP:mp_rank_{tp}_{pp}_{ep}
The key discriminator is whether expert parallelism participates in checkpoint sharding.
真正的分界点是:expert parallelism 是否参与了 checkpoint 切分。
If EP is present, treat the checkpoint layout as 3D.
If EP is absent, treat the checkpoint layout as non-EP and use 1D or 2D.
如果存在 EP,就按 3D 布局处理。
如果不存在 EP,就按非 EP 布局处理,即 1D 或 2D。
Megatron does not treat pp > 1 as meaning 3D by itself.
Megatron 不会因为 pp > 1 就自动把 checkpoint 视为 3D。
PP adds a pipeline index.
EP adds an expert index.
The third coordinate exists because EP exists, not because PP exists.
PP 只是在目录里增加 pipeline 这一维。
EP 才会增加 expert 这一维。
第三维存在的原因是 EP 存在,而不是因为 PP 存在。
So even if tp=1 and pp=1, once EP is enabled the checkpoint naming is still conceptually 3D because ranks are addressed by (tp, pp, ep).
所以即使 tp=1 且 pp=1,只要启用了 EP,checkpoint 在语义上仍然是 3D,因为 rank 仍然由 (tp, pp, ep) 共同定位。
When reading or converting checkpoints:
在读取或转换 checkpoint 时:
First decide whether EP exists in the checkpoint contract.
If EP exists, require mp_rank_{tp}_{pp}_{ep}.
If EP does not exist, read as mp_rank_{tp} or mp_rank_{tp}_{pp}.
Do not infer 3D solely from pipeline_model_parallel_size > 1.
先判断这个 checkpoint 契约里是否存在 EP。
如果存在 EP,就要求目录是 mp_rank_{tp}_{pp}_{ep}。
如果不存在 EP,就按 mp_rank_{tp} 或 mp_rank_{tp}_{pp} 去读。
不要仅凭 pipeline_model_parallel_size > 1 就推断它一定是 3D。
Bad assumption:
错误假设:
pp > 1 so loader chooses a 3D reader.
只要 pp > 1,loader 就应该走 3D reader。
Why it fails:
为什么会失败:
Dense non-MoE checkpoints with TP+PP usually use mp_rank_{tp}_{pp} only.
A 3D loader then looks for an EP coordinate that is not present.
普通 dense、非 MoE 的 TP+PP checkpoint,通常只有 mp_rank_{tp}_{pp}。
这时 3D loader 会去找并不存在的 EP 坐标,最终报错。
For this repository, follow this rule:
这个仓库建议遵循以下规则:
if expert_parallel_size is passed, require 3D
if expert_parallel_size is not passed, use non-EP loading
如果传了 expert_parallel_size,就强制按 3D 处理
如果没有传 expert_parallel_size,就走非 EP 的加载逻辑
This matches Megatron's path-building logic better than using pp > 1 as the branch condition.
这个规则比“用 pp > 1 作为分支条件”更贴近 Megatron 自己的路径生成逻辑。
What are the actual shard directory names under the checkpoint root?
Was expert_parallel_size provided by the caller?
Is the model dense or MoE?
Is the loader branching on EP or incorrectly branching on PP?
If conversion failed, which exact mp_rank_* pattern was expected and which one exists on disk?
checkpoint 根目录下,实际 shard 目录名是什么?
调用方是否传入了 expert_parallel_size?
当前模型是 dense 还是 MoE?
loader 是按 EP 分支,还是错误地按 PP 分支?
如果转换失败,程序期望的 mp_rank_* 模式是什么,磁盘上实际又是什么?
When asked to analyze a checkpoint issue, return:
当你被要求分析 checkpoint 问题时,应该返回:
the inferred layout class: 1D, 2D, or 3D
the reason for that classification
the expected directory naming pattern
whether the caller should use a non-EP loader or an EP-aware loader
any mismatch between runtime arguments and on-disk shard layout
推断出的布局类别:1D、2D 或 3D
这样分类的原因
期望的目录命名模式
调用方应使用非 EP loader 还是 EP-aware loader
运行时参数与磁盘上 shard 布局之间是否存在不匹配