تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

ptq

Name: Ptq
Author: NVIDIA

// This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

تشغيل في Manus

$ git log --oneline --stat

stars:٢٬٧٤٩

forks:٤٠٥

updated:٢١ مايو ٢٠٢٦ في ٢٢:١٦

مستكشف الملفات

6 ملفات

SKILL.md

readonly

related-skills.json

نفس المستودع

evaluation.md

from "NVIDIA/Model-Optimizer"

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

2026-05-222.7k

compare-results.md

from "NVIDIA/Model-Optimizer"

Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

2026-05-212.7k

deployment.md

from "NVIDIA/Model-Optimizer"

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

2026-05-212.7k

launching-evals.md

from "NVIDIA/Model-Optimizer"

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-05-212.7k

monitor.md

from "NVIDIA/Model-Optimizer"

Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job <slurm_job_id>", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.

2026-05-212.7k

release-cherry-pick.md

from "NVIDIA/Model-Optimizer"

Cherry-pick merged PRs labeled for a release branch into that branch, then open a PR and apply the cherry-pick-done label. Use when asked to "cherry-pick PRs for release/X.Y.Z", "pick PRs to release branch", or "cherry-pick labeled PRs".

2026-04-272.7k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/Model-Optimizer"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

علماء البياناتمهن الحاسوب والرياضيات15-2051L4

name	ptq
description	This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

ModelOpt Post-Training Quantization

Produce a quantized checkpoint from a pretrained model. Read examples/llm_ptq/README.md first — it has the support matrix, CLI flags, and accuracy guidance.

Step 1 — Environment

Read skills/common/environment-setup.md and skills/common/workspace-management.md. After completing them you should know:

ModelOpt source is available
Local or remote (+ cluster config if remote)
SLURM / Docker+GPU / bare GPU
Launcher available?
Which workspace to use

Step 2 — Is the model supported?

Check the support table in examples/llm_ptq/README.md for verified HF models.

Listed → supported, use hf_ptq.py (step 4A/4B)
Not listed → read references/unsupported-models.md to determine if hf_ptq.py can still work or if a custom script is needed (step 4C)

Step 2.5 — Check for model-specific dependencies

If the model uses trust_remote_code (check config.json for auto_map), inspect its custom Python files for imports not present in the container:

grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u

Known dependency patterns:

Import found	Packages to install
`from mamba_ssm` / `from causal_conv1d`	`mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba)

If extra deps are needed:

Launcher (4B): set EXTRA_PIP_DEPS in the task's environment section — ptq.sh installs them automatically
Manual (4A): unset PIP_CONSTRAINT && pip install <deps> before running hf_ptq.py

Step 3 — Choose quantization format

First, check for a model-specific recipe:

ls modelopt_recipes/models/ 2>/dev/null

If a model-specific recipe exists, use --recipe <path> — it may contain tuned settings.

If no model-specific recipe, choose a format based on GPU (details in examples/llm_ptq/README.md):

Blackwell (B100/B200/GB200): nvfp4 variants
Hopper (H100/H200) or older: fp8 or int4_awq

Use --qformat <name> (e.g., --qformat nvfp4). Format definitions: modelopt/torch/quantization/config.py. General PTQ recipes in modelopt_recipes/general/ptq/ correspond to the same formats — --qformat is the simpler way to use them.

Before running PTQ, sanity-check the selected qformat/recipe against the model structure. Inspect the recipe's include/exclude patterns and summarize which layer groups will be quantized and approximately how many modules/layers match (attention projections, MLP projections, experts, etc.). If the match count is 0, or far smaller than expected for the model, stop and fix the recipe or ask the user before launching calibration.

If the source checkpoint is already quantized and the requested recipe/config reduces quantization coverage, confirm that intent with the user before running. For example, if an FP8 checkpoint is used as input and the recipe excludes some layers so they would fall back to BF16 instead of staying quantized, call out the affected layer groups and ask whether that FP8-to-BF16 fallback is intended.

NVFP4 can be calibrated on Hopper but requires Blackwell for inference.

Step 4 — Run PTQ

Goal: checkpoint on disk (.safetensors + config.json).

For listed models (4A/4B): run full calibration directly (--calib_size 512). For unlisted models (4C): run a smoke test first (--calib_size 4), wait for success, then full calibration.

Which path?

In README table? ─→ YES ──→ SLURM (local or remote)? ──→ LAUNCHER (4B)
                  │          Local Docker + GPU? ────────→ LAUNCHER (4B)
                  │          Remote Docker (no SLURM)? ──→ MANUAL (4A)
                  │          Bare GPU (local or remote)? → MANUAL (4A)
                  │
                  └→ NOT LISTED ──→ UNLISTED MODEL (4C)

4A — Direct: supported model, manual execution

pip install --no-build-isolation "nvidia-modelopt[hf]"
pip install -r examples/llm_ptq/requirements.txt

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat <format> \
    --calib_size 512 \
    --export_path <output>

Run --help for all options.

For remote: use remote_run from remote_exec.sh (see skills/common/remote-execution.md).

4B — Launcher: supported model on SLURM or local Docker

Write a YAML config using common/hf_ptq/hf_ptq.sh. See references/launcher-guide.md for the full template.

cd tools/launcher
# SLURM (remote or local):
SLURM_HOST=<host> SLURM_ACCOUNT=<acct> uv run launch.py --yaml <config.yaml> user=<ssh_user> identity=<ssh_key> --yes
# Local Docker:
uv run launch.py --yaml <config.yaml> hf_local=<hf_cache> --yes

The launcher blocks and tails logs until the job completes. If the launcher fails (missing deps, config errors), fall back to path 4A (manual execution).

4C — Unlisted model

Follow references/unsupported-models.md. It walks through investigating the model, patching ModelOpt if needed, and running hf_ptq.py. Run manually (like 4A) for easier monitoring and debugging.

For SLURM, see skills/common/slurm-setup.md and references/slurm-setup-ptq.md.

Monitoring

After job submission, register the job and set up monitoring per the monitor skill.

Step 5 — Verify output

ls -lh <output_path>/
# Expect: config.json, tokenizer files, model-*.safetensors

Report the path and size to the user.

Post-quantization validation

This is a required gate before any deployment or evaluation submission. Do not submit an eval, start a serving job, or hand off the checkpoint as ready until the gate has passed.

Read references/checkpoint-validation.md and perform all three validation groups on the exact checkpoint path that will be deployed/evaluated:

Check output size and estimated bits per weight against the baseline/source checkpoint.
Check quantized-weight coverage against the requested qformat/recipe/config.
Check metadata consistency against the baseline/source model.

Report the gate result before moving on. The report must include source size, output size, output/source size ratio, layer precision counts (for example NVFP4, FP8, INT4, BF16/unquantized excluded, unexpected unquantized, declaration mismatches), and metadata diffs. If the output/source ratio is >= 1.0 for a compression recipe, if any intended layer group is missing quantization, or if metadata changed unexpectedly, stop and fix the checkpoint or ask the user before proceeding.

Next steps: If the user wants to deploy or evaluate the quantized checkpoint, use the deployment or evaluation skill. The checkpoint workspace carries over. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time.

Key API Rules

mtq.register() classes must define _setup() and call it from __init__
Call mto.enable_huggingface_checkpointing() before quantization
Wildcard *gate* matches too broadly — use *mlp.gate* or *router*
VLMs: hf_ptq.py auto-extracts the language model via extract_and_prepare_language_model_from_vl() — no manual VLM handling needed in most cases
FP8 checkpoints: prefer _QuantFP8Linear (lazy dequant) over FineGrainedFP8Config(dequantize=True) which wastes ~2x memory. See references/unsupported-models.md for details
Custom quantizer names must end with _input_quantizer or _weight_quantizer

Common Pitfalls

Model-specific dependencies: Models with trust_remote_code may import packages not in the container (e.g., mamba-ssm for hybrid Mamba models). See Step 2.5. Use EXTRA_PIP_DEPS env var with the launcher, or install manually before running hf_ptq.py
Transformers version: New models may need a newer version of transformers than what's installed. Check config.json for transformers_version. In containers, beware of PIP_CONSTRAINT blocking upgrades — see references/slurm-setup-ptq.md for workarounds
Gated datasets: Some calibration datasets require HF authentication. Ensure HF_TOKEN is set in the job environment, or use --dataset cnn_dailymail as a non-gated alternative
NFS root_squash + Docker: See skills/common/slurm-setup.md section 5

References

Reference	When to read
`skills/common/environment-setup.md`	Step 1: always
`skills/common/workspace-management.md`	Step 1: always
`references/launcher-guide.md`	Step 4B only (launcher path)
`tools/launcher/CLAUDE.md`	Step 4B only, if you need more launcher detail
`references/unsupported-models.md`	Step 4C only (unlisted model)
`references/checkpoint-validation.md`	Step 5: mandatory post-PTQ gate before deployment/evaluation
`skills/common/remote-execution.md`	Step 4A/4C only, if target is remote
`skills/common/slurm-setup.md`	Step 4A/4C only, if using SLURM manually (not launcher)
`references/slurm-setup-ptq.md`	Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2)
`examples/llm_ptq/README.md`	Step 3: support matrix, CLI flags, accuracy
`modelopt/torch/quantization/config.py`	Step 3: format definitions
`modelopt/torch/export/model_utils.py`	Step 4C: TRT-LLM export type mapping
`modelopt_recipes/`	Step 3: pre-built recipes

ptq

المزيد من هذا المستودع

المزيد من هذا المستودع

ModelOpt Post-Training Quantization

Step 1 — Environment

Step 2 — Is the model supported?

Step 2.5 — Check for model-specific dependencies

Step 3 — Choose quantization format

Step 4 — Run PTQ

Which path?

4A — Direct: supported model, manual execution

4B — Launcher: supported model on SLURM or local Docker

4C — Unlisted model

Monitoring

Step 5 — Verify output

Post-quantization validation

Key API Rules

Common Pitfalls

References

ModelOpt Post-Training Quantization

Step 1 — Environment

Step 2 — Is the model supported?

Step 2.5 — Check for model-specific dependencies

Step 3 — Choose quantization format

Step 4 — Run PTQ

Which path?

4A — Direct: supported model, manual execution

4B — Launcher: supported model on SLURM or local Docker

4C — Unlisted model

Monitoring

Step 5 — Verify output

Post-quantization validation

Key API Rules

Common Pitfalls

References