一键导入
sglang-diffusion-modelopt-quant
// Use when quantizing a diffusion DiT with NVIDIA ModelOpt and making the resulting FP8 or NVFP4 checkpoint loadable, verifiable, and benchmarkable in SGLang Diffusion.
// Use when quantizing a diffusion DiT with NVIDIA ModelOpt and making the resulting FP8 or NVFP4 checkpoint loadable, verifiable, and benchmarkable in SGLang Diffusion.
Trigger the bot-cherry-pick workflow for a batch of merged PRs onto a release branch and monitor each run to completion. Use when an SGLang release manager asks to cherry-pick a list of PRs to a release branch.
Guide for writing SGLang CI/UT tests. Covers CustomTestCase, CI registration, server fixtures, model selection, mock testing, and test placement. Always read test/README.md for the full CI layout, how to run tests, and extra tips. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.
Use when benchmarking denoise latency or profiling a diffusion bottleneck in SGLang.
Use when choosing the fastest SGLang Diffusion flags for a model, GPU, and VRAM budget.
Guide to SGLang CI workflow orchestration — stage ordering, fast-fail, gating, partitioning, execution modes, and debugging CI failures. Use when modifying CI workflows, adding stages, debugging CI pipeline issues, or understanding how tests are dispatched and gated across stages.
Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
| name | sglang-diffusion-modelopt-quant |
| description | Use when quantizing a diffusion DiT with NVIDIA ModelOpt and making the resulting FP8 or NVFP4 checkpoint loadable, verifiable, and benchmarkable in SGLang Diffusion. |
Use this skill when the task is to take a diffusion transformer through the full ModelOpt workflow:
This skill owns the ModelOpt-to-SGLang bridge. It is not a generic kernel-tuning skill.
quantize.py as the PTQ source of truth.dit_cpu_offload=false. dit_layerwise_offload=true is valid on the fixed path when you want lower DiT residency.SGLANG_DIFFUSION_FLASHINFER_FP4_GEMM_BACKEND=cudnn; benchmark the default CUTLASS path separately if that is what you are evaluating.python/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.py, python/sglang/multimodal_gen/tools/build_modelopt_nvfp4_transformer.py, and python/sglang/multimodal_gen/tools/compare_diffusion_trajectory_similarity.py instead of inventing one-off scripts elsewhere.docs/diffusion/quantization.md before closing the task.Read these sources before changing code:
examples/diffusers/README.mdexamples/diffusers/quantization/quantize.pyexamples/diffusers/quantization/config.pypython/sglang/multimodal_gen/runtime/layers/quantization/modelopt_quant.pypython/sglang/multimodal_gen/runtime/utils/quantization_utils.pypython/sglang/multimodal_gen/runtime/loader/transformer_load_utils.pyIf you are working on a new model family, inspect the transformer's config and tensor naming before changing the generic converter.
This repo now contains:
quant_method=modelopt plus quant_algo=FP8/NVFP4 resolutionpython/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.pypython/sglang/multimodal_gen/tools/build_modelopt_nvfp4_transformer.pypython/sglang/multimodal_gen/tools/compare_diffusion_trajectory_similarity.pyValidated documentation and CI coverage currently center on these ModelOpt diffusion transformer override families:
Treat a new family, a new precision, or a new checkpoint layout as unsupported until it has a documented matrix row and a matching validation story.
Before writing CLI examples, re-read the active branch's docs/diffusion/quantization.md: FLUX.2 NVFP4 is an official black-forest-labs/* repo rather than a lmsys/* converted repo, and its preferred flag depends on the current documented loader flow. Use --transformer-path for a component override directory with config.json; use --transformer-weights-path when the repo or path should be probed as raw weights.
B200 CI coverage can include loose BF16-vs-quantized quality checks. Inspect the active branch's run_suite.py before assuming they are part of the suite; mainline and feature branches may differ. Those checks are intended to catch blank, corrupted, or obviously divergent images, not exact image parity.
Mainline documentation now uses lmsys/* for the eight converted ModelOpt
checkpoint repos; the FLUX.2 NVFP4 raw export remains
black-forest-labs/FLUX.2-dev-NVFP4. Do not use older BBuf/* examples unless
you are explicitly testing a historical branch.
As of 2026-05-04, these related SGLang PRs are relevant to ModelOpt diffusion support. Treat unmerged items as future support or migration work until the docs/CI matrix is updated.
Do not expand the validated matrix beyond the documented rows solely because a related PR exists. Add a row only after the exact checkpoint, loader path, accuracy check, and benchmark scope are validated on the active branch.
docs/diffusion/quantization.md.unpublished explicitly instead of leaving the field blank.FP8 and NVFP4 are not wired into SGLang in exactly the same way.
FP8:
weight_scale and input_scalefloat8_e4m3fn weights from backbone.ptNVFP4:
Important caveat:
Before quantizing anything:
perf.jsonDo not start quantization work until the BF16 path is already healthy.
Use ModelOpt's official script. Generic template:
python quantize.py \
--model <model-name> \
--override-model-path <hf-repo-or-local-model> \
--model-dtype <Half|BFloat16> \
--format <fp8|fp4> \
--batch-size 1 \
--calib-size <calib-size> \
--n-steps <calib-steps> \
--quantize-mha \
--prompts-file <prompt-file> \
--quantized-torch-ckpt-save-path <out>/ckpt \
--hf-ckpt-dir <out>/hf
For current ModelOpt diffusion examples, use --format fp4 for NVFP4 exports.
Do not assume the checked-out ModelOpt version accepts a literal nvfp4 format string unless you verified it locally.
For multi-transformer models:
backbone.pt and the matching hf/<component> exportFP8 requires an extra conversion step:
PYTHONPATH=python python3 -m sglang.multimodal_gen.tools.build_modelopt_fp8_transformer \
--modelopt-hf-dir <out>/hf \
--modelopt-backbone-ckpt <out>/ckpt/backbone.pt \
--base-transformer-dir <base-model-transformer-dir> \
--output-dir <out>/sglang_transformer \
--overwrite
What the converter does:
weight_quantizer._amax and input_quantizer._amax from backbone.ptweight_scale and input_scalefloat8_e4m3fnignore layers as BF16_quantizer.* tensors and fallback-layer scales that should not survive into the SGLang-native checkpointFor FLUX.1-dev, the validated fallback set currently keeps these modules in BF16:
transformer_blocks.*.norm1.lineartransformer_blocks.*.norm1_context.lineartransformer_blocks.*.ff.net.0.projtransformer_blocks.*.ff.net.2transformer_blocks.*.ff_context.net.0.projtransformer_blocks.*.ff_context.net.2single_transformer_blocks.*.norm.linearsingle_transformer_blocks.*.proj_mlpUse --model-type flux1 to force that profile, or rely on --model-type auto when the export config identifies FluxTransformer2DModel.
HunyuanVideo uses HunyuanVideoTransformer3DModel, so the validated
HunyuanVideo FP8 fallback preset keeps these modules in BF16:
context_embedder.*x_embedder.projtime_text_embed.(timestep_embedder|guidance_embedder|text_embedder).linear_[12]norm_out.linearproj_outtransformer_blocks.*.norm1.lineartransformer_blocks.*.norm1_context.linearsingle_transformer_blocks.*.norm.linearUse --model-type hunyuan-video to force that profile, or rely on
--model-type auto when the export config identifies
HunyuanVideoTransformer3DModel.
HunyuanVideo ModelOpt exports use diffusers module names that differ from
SGLang runtime names for fused QKV and fused QKV+MLP layers. Keep the
diffusers-to-runtime mapping in build_modelopt_fp8_transformer.py in sync
with runtime/models/dits/hunyuanvideo.py before trusting converted scale
tensors.
Qwen Image and Qwen Image Edit share QwenImageTransformer2DModel, so one
ModelOpt FP8 fallback preset covers both. The validated Qwen Image fallback set
keeps these modules in BF16:
img_intxt_intime_text_embed.timestep_embedder.linear_1time_text_embed.timestep_embedder.linear_2norm_out.linearproj_outtransformer_blocks.*.img_mlp.net.2transformer_blocks.*.img_modtransformer_blocks.*.txt_modUse --model-type qwen-image to force that profile, or rely on
--model-type auto when the export config identifies
QwenImageTransformer2DModel.
Qwen modulation weights can appear in safetensors as .img_mod.1.weight and
.txt_mod.1.weight. Canonicalize those module names to .img_mod and
.txt_mod before fallback matching.
For Qwen Image FP8, explicit BF16 fallback tensors must be written before honoring ModelOpt ignored weights. Otherwise converter stats can report a fallback while the output checkpoint still retains the source FP8 tensor, which causes severe image-quality regressions.
For FLUX.1-dev NVFP4 model families that need a mixed BF16+NVFP4 checkpoint, build the merged transformer explicitly:
PYTHONPATH=python python3 -m sglang.multimodal_gen.tools.build_modelopt_nvfp4_transformer \
--base-transformer-dir <base-model-transformer-dir> \
--modelopt-hf-dir <out>/hf/transformer \
--output-dir <out>/transformer-mixed \
--pattern-preset flux1-nvfp4
The validated FLUX.1-dev mixed builder also needs to preserve:
quant_type: NVFP4 in config.jsonswap_weight_nibbles: false for the validated diffusers exportSingle-transformer example:
sglang generate \
--model-path <base-model> \
--transformer-path <quantized-transformer> \
--prompt "<prompt>" \
--seed <seed> \
--save-output
Multi-transformer example:
sglang generate \
--model-path <base-model> \
--transformer-path <quantized-transformer> \
--transformer-2-path <another-transformer-or-bf16-override> \
--prompt "<prompt>" \
--seed <seed> \
--save-output
Guideline:
--transformer-path only when the model effectively has one transformer override to apply--<component>-path--component_paths.transformer_2=... also resolve to the same internal override mapUse two levels of validation.
Reduced deterministic validation:
Tool:
PYTHONPATH=python python3 -m sglang.multimodal_gen.tools.compare_diffusion_trajectory_similarity \
--model-path <base-model> \
--model-id <optional-native-model-id> \
--prompt "<prompt>" \
--width <w> \
--height <h> \
--num-inference-steps <steps> \
--guidance-scale <cfg> \
--seed <seed> \
--candidate-transformer-path <quantized-transformer> \
--output-json <report.json>
Use --model-id FLUX.1-dev when --model-path points to a local directory but the runtime still needs the native FLUX.1 model registration.
Full-output validation:
Benchmark only when these match between BF16 and quantized:
Only the quantized checkpoint path should differ.
Interpretation rule:
If the generic FP8 path fails on a new model family:
Do not turn one validated model quirk into a generic rule unless another family also needs it.
Current diffusion ModelOpt FP8 support requires:
dit_cpu_offload=falsedit_layerwise_offload may be enabled when you want lower DiT residencyReason:
dit_cpu_offload is still treated conservativelyRuntime behavior:
dit_cpu_offload when it detects modelopt_fp8When documenting results:
| File | Role |
|---|---|
runtime/layers/quantization/__init__.py | registers diffusion quant methods |
runtime/layers/quantization/modelopt_quant.py | ModelOpt FP8 and NVFP4 runtime loading |
runtime/utils/quantization_utils.py | resolves flat ModelOpt configs and reconstructs NVFP4 config from metadata |
runtime/loader/transformer_load_utils.py | guards incompatible FP8 offload modes |
runtime/models/dits/flux_2.py | packed-QKV handling for the packed FLUX.2 NVFP4 family |
tools/build_modelopt_fp8_transformer.py | Build an SGLang-loadable FP8 transformer from a ModelOpt export |
tools/build_modelopt_nvfp4_transformer.py | Build mixed BF16+NVFP4 transformer directories when a family needs preserved BF16 layers |
tools/compare_diffusion_trajectory_similarity.py | reduced deterministic BF16-vs-quantized validation |
docs/diffusion/quantization.md | public ModelOpt support matrix and CLI examples |
test/server/testcase_configs.py | reusable ModelOpt testcase constants, thresholds, and helpers |
test/server/gpu_cases.py | concrete GPU and B200 ModelOpt CI case lists |