com um clique
model-normalize
// Normalize an HF model checkpoint — quantize bf16/fp16 weights to per-block FP8, and/or repack engine-named safetensors into HF-standard ~4GB shards.
// Normalize an HF model checkpoint — quantize bf16/fp16 weights to per-block FP8, and/or repack engine-named safetensors into HF-standard ~4GB shards.
| name | model_normalize |
| description | Normalize an HF model checkpoint — quantize bf16/fp16 weights to per-block FP8, and/or repack engine-named safetensors into HF-standard ~4GB shards. |
Two complementary capabilities for getting a training-engine checkpoint into a format that HF / vLLM / SGLang can consume:
model-{i:05d}-of-{n:05d}.safetensors layout with ~4GB
shards and a fresh model.safetensors.index.json.When to-fp8 detects non-standard source shard names, it automatically
runs the repack step on the quantized output, so the user-visible result is
always a clean HF-conformant directory.
.claude/skills/model_normalize/
├── SKILL.md # this file
├── model_normalize.py # top-level CLI (to-fp8 / repack)
├── hf_to_fp8.py # FP8 conversion library + standalone CLI
├── heuristics.py # codified default quantization rule set
├── repack_hf.py # 4GB-shard repacker library + CLI
└── convert_interns2_to_fp8.sh # example invocation for InternS2
python model_normalize.py to-fp8 \
--source /path/to/bf16-model \
--output /path/to/fp8-model \
--reference /path/to/fp8-reference
Behavior:
model.safetensors.index.json.<name>_scale_inv entry in the reference index, marks <name>
as an FP8-quantized tensor.python model_normalize.py to-fp8 \
--source /path/to/bf16-model \
--output /path/to/fp8-model
The rule set is defined in heuristics.py. Concretely, quantize iff the
tensor name matches one of:
| Group | Pattern (anchored) |
|---|---|
| Self-attention projections | […]layers.<i>.self_attn.{q,k,v,o}_proj.weight |
| Dense MLP | […]layers.<i>.mlp.{gate,up,down}_proj.weight |
| MoE per-expert (unfused) | […]layers.<i>.mlp.experts.<j>.{gate,up,down}_proj.weight |
| MoE fused-expert (3D tensors) | […]layers.<i>.mlp.experts.{gate_up_proj, down_proj} |
| MoE shared expert | […]layers.<i>.mlp.shared_expert.{gate,up,down}_proj.weight |
| Linear-attn wide projections | […]layers.<i>.linear_attn.{in_proj_qkv, in_proj_z, out_proj}.weight |
The […] prefix allows model. and model.language_model.; the same
patterns repeat under mtp.layers.<i>. for multi-token-prediction blocks.
Everything else is kept in the original dtype, in particular:
*norm*, *_layernorm.*)mlp.gate.weight, mlp.shared_expert_gate.weight)embed_tokens, lm_head, pos_embed, patch_embed, mtp.fc)model.visual.*)*.biaslinear_attn control-flow tensors (A_log, conv1d.weight, dt_bias, in_proj_a/b.weight, norm.weight)This rule set was validated against the InternS2 / Qwen3-MoE FP8 references: it reproduces their quantization decisions on every tensor (no false positives or false negatives).
python model_normalize.py repack \
--source /path/to/messy-shards \
--output /path/to/standard-shards \
--shard-size-gb 4
Greedy bin-packs tensors into model-{i:05d}-of-{n:05d}.safetensors shards
of at most --shard-size-gb GiB, writes a fresh
model.safetensors.index.json (including metadata.total_size), and
copies all non-safetensors files (config, tokenizer, etc.) through.
| Situation | Mode |
|---|---|
| You already have an FP8 reference for this architecture. | Reference-guided |
| New architecture / first-time conversion; standard naming. | Heuristic |
| Quantized model has engine-specific shard names; need HF-conformant output. | (auto, see below) |
| You only need to standardize shard naming; weights are already FP8 / bf16. | repack |
The to-fp8 subcommand auto-detects non-standard source shard names (anything
that isn't model.safetensors or model-NNNNN-of-NNNNN.safetensors) and
chains the repack step. Pass --no-repack to keep the source naming.
float8_e4m3fn with saturating cast.<name>_scale_inv (float32) in the same
shard as the parent weight. Layout matches what vLLM / SGLang load for
per-block FP8.(E, D0, D1)) are quantized per-expert; the
per-expert scales are stacked along a leading dim.config.json receives a quantization_config block with
quant_method=fp8, weight_block_size=[128, 128],
scale_fmt=ue8m0, and an explicit modules_to_not_convert list
derived from the actual quantization decisions.After any successful run of to-fp8 or repack, the agent must ask the
user — via AskUserQuestion — where to place the produced
*.safetensors and model.safetensors.index.json files. The conversion
output is treated as a staging area; the user decides the final home.
Offer at least these options:
--output path as-is.cp), never move — the
--output staging directory must remain intact unless the user
explicitly asks to delete it as a separate step.If the user picks (2) or (3), execute the copy with cp (confirming the
destination path back to the user before any overwrite). Do not infer the
destination from earlier turns — the user may have a fresh target in mind
each time. Cleaning up the staging directory afterwards is a separate,
user-initiated step — never delete it as part of the handoff.
python model_normalize.py to-fp8 \
--source /mnt/shared-storage-user/puyudelivery/user/yehaochen/models/InternS2PreviewCandidate5 \
--output /tmp/InternS2-FP8 \
--reference /mnt/shared-storage-user/puyudelivery/user/yehaochen/models/InternS2PreviewCandidate5-FP8
The source uses engine-specific names like
model-language-0001-fused-save_rank0.safetensors; the output will be
repacked into the standard model-00001-of-NNNNN.safetensors layout.
Merge a base HF model dir with an extra HF model dir by index diff — take tensors only present in extra, append them to a new output dir in HF-standard shard layout. Base wins on overlap.
Use when `make html` or `sphinx-build` fails with `Extension error (sphinx.ext.autosummary)`, `ImportExceptionGroup`, or other Sphinx autosummary / autodoc import failures. Provides automated pdb diagnosis for `autodoc_mock_imports` issues.
Synchronize xtuner's supported model documentation (docs/en/pretrain_sft/advanced_tutorial/model.md and docs/zh_cn/pretrain_sft/advanced_tutorial/model.md) with the actual Config classes defined under xtuner/v1/model/. Use when (1) new TransformerConfig, MoEConfig, or BaseComposeConfig subclasses are added, removed, or renamed in xtuner/v1/model/, (2) existing model configs change their inheritance hierarchy, scale, or HuggingFace counterpart, or (3) a code review or user request points out that model.md is out of sync with the codebase.