| name | ptq |
| description | This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt. |
ModelOpt Post-Training Quantization
Produce a quantized checkpoint from a pretrained model. Read examples/llm_ptq/README.md first — it has the support matrix, CLI flags, and accuracy guidance.
Step 1 — Environment
Read skills/common/environment-setup.md and skills/common/workspace-management.md. After completing them you should know:
- ModelOpt source is available
- Local or remote (+ cluster config if remote)
- SLURM / Docker+GPU / bare GPU
- Launcher available?
- Which workspace to use
Step 2 — Is the model supported?
Check the support table in examples/llm_ptq/README.md for verified HF models.
- Listed → supported, use
hf_ptq.py (step 4A/4B)
- Not listed → read
references/unsupported-models.md to determine if hf_ptq.py can still work or if a custom script is needed (step 4C)
Step 2.5 — Check for model-specific dependencies
If the model uses trust_remote_code (check config.json for auto_map), inspect its custom Python files for imports not present in the container:
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
Known dependency patterns:
| Import found | Packages to install |
|---|
from mamba_ssm / from causal_conv1d | mamba-ssm causal-conv1d (Mamba/hybrid models: NemotronH, Jamba) |
If extra deps are needed:
- Launcher (4B): set
EXTRA_PIP_DEPS in the task's environment section — ptq.sh installs them automatically
- Manual (4A):
unset PIP_CONSTRAINT && pip install <deps> before running hf_ptq.py
Step 3 — Choose quantization format
First, check for a model-specific recipe:
ls modelopt_recipes/models/ 2>/dev/null
If a model-specific recipe exists, use --recipe <path> — it may contain tuned settings.
If no model-specific recipe, choose a format based on GPU (details in examples/llm_ptq/README.md):
- Blackwell (B100/B200/GB200):
nvfp4 variants
- Hopper (H100/H200) or older:
fp8 or int4_awq
Use --qformat <name> (e.g., --qformat nvfp4). Format definitions: modelopt/torch/quantization/config.py. General PTQ recipes in modelopt_recipes/general/ptq/ correspond to the same formats — --qformat is the simpler way to use them.
Before running PTQ, sanity-check the selected qformat/recipe against the model structure. Inspect the recipe's include/exclude patterns and summarize which layer groups will be quantized and approximately how many modules/layers match (attention projections, MLP projections, experts, etc.). If the match count is 0, or far smaller than expected for the model, stop and fix the recipe or ask the user before launching calibration.
If the source checkpoint is already quantized and the requested recipe/config reduces quantization coverage, confirm that intent with the user before running. For example, if an FP8 checkpoint is used as input and the recipe excludes some layers so they would fall back to BF16 instead of staying quantized, call out the affected layer groups and ask whether that FP8-to-BF16 fallback is intended.
NVFP4 can be calibrated on Hopper but requires Blackwell for inference.
Step 4 — Run PTQ
Goal: checkpoint on disk (.safetensors + config.json).
For listed models (4A/4B): run full calibration directly (--calib_size 512).
For unlisted models (4C): run a smoke test first (--calib_size 4), wait for success, then full calibration.
Which path?
In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
│ Local Docker + GPU? ────────→ LAUNCHER (4B)
│ Remote Docker (no SLURM)? ──→ MANUAL (4A)
│ Bare GPU (local or remote)? → MANUAL (4A)
│
└→ NOT LISTED ──→ UNLISTED MODEL (4C)
4A — Direct: supported model, manual execution
pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <model> \
--qformat <format> \
--calib_size 512 \
--export_path <output>
Run --help for all options.
For remote: use remote_run from remote_exec.sh (see skills/common/remote-execution.md).
4B — Launcher: supported model on SLURM or local Docker
Write a YAML config using common/hf_ptq/hf_ptq.sh. See references/launcher-guide.md for the full template.
cd tools/launcher
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes
The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).
4C — Unlisted model
Follow references/unsupported-models.md. It walks through investigating the model, patching ModelOpt if needed, and running hf_ptq.py. Run manually (like 4A) for easier monitoring and debugging.
For SLURM, see skills/common/slurm-setup.md and references/slurm-setup-ptq.md.
Monitoring
After job submission, register the job and set up monitoring per the monitor skill.
Step 5 — Verify output
ls -lh <output_path>/
Report the path and size to the user.
Post-quantization validation
This is a required gate before any deployment or evaluation submission. Do not submit an eval, start a serving job, or hand off the checkpoint as ready until the gate has passed.
Read references/checkpoint-validation.md and perform all three validation groups on the exact checkpoint path that will be deployed/evaluated:
- Check output size and estimated bits per weight against the baseline/source checkpoint.
- Check quantized-weight coverage against the requested qformat/recipe/config.
- Check metadata consistency against the baseline/source model.
Report the gate result before moving on. The report must include source size, output size, output/source size ratio, layer precision counts (for example NVFP4, FP8, INT4, BF16/unquantized excluded, unexpected unquantized, declaration mismatches), and metadata diffs. If the output/source ratio is >= 1.0 for a compression recipe, if any intended layer group is missing quantization, or if metadata changed unexpectedly, stop and fix the checkpoint or ask the user before proceeding.
Next steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.
Key API Rules
mtq.register() classes must define _setup() and call it from __init__
- Call
mto.enable_huggingface_checkpointing() before quantization
- Wildcard
*gate* matches too broadly — use *mlp.gate* or *router*
- VLMs:
hf_ptq.py auto-extracts the language model via extract_and_prepare_language_model_from_vl() — no manual VLM handling needed in most cases
- FP8 checkpoints: prefer
_QuantFP8Linear (lazy dequant) over FineGrainedFP8Config(dequantize=True) which wastes ~2x memory. See references/unsupported-models.md for details
- Custom quantizer names must end with
_input_quantizer or _weight_quantizer
Common Pitfalls
- Model-specific dependencies: Models with
trust_remote_code may import packages not in the container (e.g., mamba-ssm for hybrid Mamba models). See Step 2.5. Use EXTRA_PIP_DEPS env var with the launcher, or install manually before running hf_ptq.py
- Transformers version: New models may need a newer version of transformers than what's installed. Check
config.json for transformers_version. In containers, beware of PIP_CONSTRAINT blocking upgrades — see references/slurm-setup-ptq.md for workarounds
- Gated datasets: Some calibration datasets require HF authentication. Ensure
HF_TOKEN is set in the job environment, or use --dataset cnn_dailymail as a non-gated alternative
- NFS root_squash + Docker: See
skills/common/slurm-setup.md section 5
References
| Reference | When to read |
|---|
skills/common/environment-setup.md | Step 1: always |
skills/common/workspace-management.md | Step 1: always |
references/launcher-guide.md | Step 4B only (launcher path) |
tools/launcher/CLAUDE.md | Step 4B only, if you need more launcher detail |
references/unsupported-models.md | Step 4C only (unlisted model) |
references/checkpoint-validation.md | Step 5: mandatory post-PTQ gate before deployment/evaluation |
skills/common/remote-execution.md | Step 4A/4C only, if target is remote |
skills/common/slurm-setup.md | Step 4A/4C only, if using SLURM manually (not launcher) |
references/slurm-setup-ptq.md | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) |
examples/llm_ptq/README.md | Step 3: support matrix, CLI flags, accuracy |
modelopt/torch/quantization/config.py | Step 3: format definitions |
modelopt/torch/export/model_utils.py | Step 4C: TRT-LLM export type mapping |
modelopt_recipes/ | Step 3: pre-built recipes |