Run any Skill in Manus with one click

quant-recipe-search

Use when the user asks to find, search for, or optimize the best quantization recipe for a model, including direct requests like "find the best quantization recipe and generate a PTQ checkpoint." Guides the multi-candidate loop: choose compute-vs-memory success metrics, select ModelOpt recipe baselines, design AutoQuant/manual recipe deltas, interpret sensitivity, and decide next candidates. Do NOT use for a single known PTQ recipe run (use ptq), serving (use deployment), creating/running evals (use evaluation or launching-evals), monitoring jobs (use monitor), MLflow browsing (use accessing-mlflow), or comparing completed baseline-vs-candidate scores only (use compare-results).

Run Skill in Manus

Stars2,897

Forks433

UpdatedJune 5, 2026 at 23:12

Source

NVIDIA

NVIDIA/Model-Optimizer

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

File Explorer

3 files

SKILL.md

readonly

name

quant-recipe-search

description

Quant Recipe Search

Use this skill when quantization is an iterative recipe search, not a one-off PTQ run. The skill owns strategy: define success, choose the search space, sequence candidates, and decide the next iteration. It delegates checkpoint generation, serving, evaluation, monitoring, and metric comparison to the existing execution skills.

Treat a direct request such as "find the best quantization recipe and generate a PTQ checkpoint for this model" as enough to start. Recover local state first, then ask only for missing decisions that change the search.

Skill Boundaries

Use ptq to produce and validate checkpoints.
Use deployment to serve checkpoints and debug serving-specific flags.
Use evaluation to create NEL configs and submit evals.
Use launching-evals to run, resume, debug, and analyze NEL runs.
Use monitor for active job tracking.
Use accessing-mlflow for MLflow artifact lookup.
Use compare-results for validated baseline-vs-candidate deltas and score-field comparability.

Do not duplicate those workflows here. This skill should leave the user with a clear recipe portfolio, success metric, experiment sequence, and next decision.

Problem

The task is to find the best recipe for a user-defined target, not merely to produce a quantized checkpoint. A generated PTQ checkpoint is only a candidate. It becomes a recommended recipe only after evaluation and comparison against the matching baseline.

Required inputs before planning candidates:

Optimization goal: compute/throughput, memory/latency, or a custom metric.
Primary quantization family: for example NVFP4, W4A16 NVFP4, FP8/W8A8, INT4/AWQ, or a custom mixed set.
Benchmark set or baseline results: the user-defined acceptance surface.

If any of these are missing, ask for them. Do not silently default to FP8/W8A8 or call a checkpoint "best" before evaluation.

Default success rule: maximize the chosen performance objective while keeping each benchmark within 1 percentage point of the matching BF16/FP16 baseline. Near-threshold or noisy regressions require reruns before making a decision.

Search Space

Keep the search space explicit. A candidate recipe is a tuple across these axes:

Numeric format: FP8/W8A8, NVFP4/W4A4, W4A16 NVFP4, INT4/AWQ, or mixed formats such as NVFP4+FP8.
Calibration/search algorithm: max calibration, MSE calibration, GPTQ, AWQ, AutoQuant scoring, and calibration dataset or sample-count variants.
Selection method: manual/heuristic rules, sensitivity-guided manual recipes, AutoQuant selection, or a hybrid of AutoQuant plus manual overrides.
Module family: attention, MLP, MoE experts, routers/gates, embeddings, lm_head, adapters, vision encoders, and model-specific modules.
Runtime fusion constraints: modules fused by the inference library must use compatible quantization. Examples: vLLM Qwen linear_attn.in_proj_qkvz and fused MoE expert projections such as gate/up (w1/w3).
Calibration budget: dataset mix, sample count, sequence length, and batch settings.

Do not collapse the search to one dimension such as numeric format only. Read references/recipe_iteration.md when choosing concrete axes or candidates.

Design Workflow

Recover state
- Read result tables, recipe logs, AutoQuant states, sensitivity reports, and experiment notes before proposing new work.
- Ask monitor, launching-evals, or compare-results to recover active job state and completed metrics when needed.
Define the target
- Confirm the optimization goal, primary quantization family, benchmark set, accuracy-loss threshold, calibration budget, and cost metric.
- Include quantization metadata such as scale storage in active-cost or size estimates.
Pick baselines and first candidates
- Always include BF16/FP16 and a near-lossless FP8/W8A8 baseline unless FP8 itself is the target.
- For ModelOpt work, start from modelopt_recipes: model-specific recipes first, then general PTQ presets or recipe fragments.
- Add an AutoQuant candidate in the requested primary family when AutoQuant is available. Expect AutoQuant to find a better trade-off than a first manual recipe, but validate that assumption with the same evals.
- Add at least one manual or sensitivity-guided candidate so AutoQuant can be compared against controlled ablations and there is a fallback if AutoQuant misses the best frontier or hits runtime constraints.
Generate candidates
- Delegate checkpoint generation and PTQ validation to ptq.
- Change one major axis at a time: format, calibration algorithm, module selection, granularity, exclusions, or calibration data.
- Use AutoQuant for broad candidate generation and sensitivity reports; use manual recipes for controlled module-family ablations and overrides.
Gate before scaling
- Validate checkpoint coverage and metadata.
- Reject or rewrite recipes that mix quantization algorithms inside a fused runtime group.
- If the checkpoint is valid but serving fails due to runtime support, do not reject the recipe immediately. Delegate to deployment / debug for small patches or flags, then rerun a pipe-clean check.

Iteration Loop

Run cheap screen evals for every candidate that passes the gates.
Compare accuracy, verbosity/token usage, and active cost against baselines.
Rerun noisy or near-threshold results before labeling a regression.
Decide the next candidate:
- Accuracy drop: protect or ablate sensitive module families, try MSE/GPTQ, or use AutoQuant sensitivity to choose overrides.
- Poor performance/cost: quantize the next high-cost active family, adjust active-cost objective, or try a more aggressive format.
- AutoQuant underperforms manual recipes: inspect sensitivity reports, achieved bits, excluded modules, and runtime-fusion constraints; keep the manual recipe in the portfolio instead of forcing the AutoQuant result.
- Runtime incompatibility: rewrite around fused groups or isolate deployment support from checkpoint quality.
- Repeated AutoQuant recipes: inspect achieved bits and recipe hashes, then adjust constraints before launching a larger sweep.
Promote only when compare-results shows the candidate is comparable to the baseline and satisfies the user-defined goal.

Maintain a recipe portfolio table with recipe name, objective, active-cost estimate, calibration notes, checkpoint path, eval/log references, accuracy, verbosity, and decision.

References

For recipe design, search-space details, sensitivity, and active-cost accounting, read references/recipe_iteration.md.
For a concrete prior case study, read references/qwen36_case_study.md only when Qwen3.5/Qwen3.6 details are relevant.

More from this repository

same repository

day0-release

NVIDIA/Model-Optimizer

Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).

2026-06-092.9k

evaluation

NVIDIA/Model-Optimizer

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

2026-06-082.9k

ptq

NVIDIA/Model-Optimizer

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

2026-06-082.9k

deployment

NVIDIA/Model-Optimizer

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

2026-06-072.9k

compare-results

NVIDIA/Model-Optimizer

Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

2026-06-052.9k

launching-evals

NVIDIA/Model-Optimizer

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-06-052.9k

name

quant-recipe-search

description

Quant Recipe Search

Skill Boundaries

Use ptq to produce and validate checkpoints.
Use deployment to serve checkpoints and debug serving-specific flags.
Use evaluation to create NEL configs and submit evals.
Use launching-evals to run, resume, debug, and analyze NEL runs.
Use monitor for active job tracking.
Use accessing-mlflow for MLflow artifact lookup.
Use compare-results for validated baseline-vs-candidate deltas and score-field comparability.

Do not duplicate those workflows here. This skill should leave the user with a clear recipe portfolio, success metric, experiment sequence, and next decision.

Problem

Required inputs before planning candidates:

Optimization goal: compute/throughput, memory/latency, or a custom metric.
Primary quantization family: for example NVFP4, W4A16 NVFP4, FP8/W8A8, INT4/AWQ, or a custom mixed set.
Benchmark set or baseline results: the user-defined acceptance surface.

If any of these are missing, ask for them. Do not silently default to FP8/W8A8 or call a checkpoint "best" before evaluation.

Search Space

Keep the search space explicit. A candidate recipe is a tuple across these axes:

Numeric format: FP8/W8A8, NVFP4/W4A4, W4A16 NVFP4, INT4/AWQ, or mixed formats such as NVFP4+FP8.
Calibration/search algorithm: max calibration, MSE calibration, GPTQ, AWQ, AutoQuant scoring, and calibration dataset or sample-count variants.
Selection method: manual/heuristic rules, sensitivity-guided manual recipes, AutoQuant selection, or a hybrid of AutoQuant plus manual overrides.
Module family: attention, MLP, MoE experts, routers/gates, embeddings, lm_head, adapters, vision encoders, and model-specific modules.
Runtime fusion constraints: modules fused by the inference library must use compatible quantization. Examples: vLLM Qwen linear_attn.in_proj_qkvz and fused MoE expert projections such as gate/up (w1/w3).
Calibration budget: dataset mix, sample count, sequence length, and batch settings.

Do not collapse the search to one dimension such as numeric format only. Read references/recipe_iteration.md when choosing concrete axes or candidates.

Design Workflow

Recover state
- Read result tables, recipe logs, AutoQuant states, sensitivity reports, and experiment notes before proposing new work.
- Ask monitor, launching-evals, or compare-results to recover active job state and completed metrics when needed.
Define the target
- Confirm the optimization goal, primary quantization family, benchmark set, accuracy-loss threshold, calibration budget, and cost metric.
- Include quantization metadata such as scale storage in active-cost or size estimates.
Pick baselines and first candidates
- Always include BF16/FP16 and a near-lossless FP8/W8A8 baseline unless FP8 itself is the target.
- For ModelOpt work, start from modelopt_recipes: model-specific recipes first, then general PTQ presets or recipe fragments.
- Add an AutoQuant candidate in the requested primary family when AutoQuant is available. Expect AutoQuant to find a better trade-off than a first manual recipe, but validate that assumption with the same evals.
- Add at least one manual or sensitivity-guided candidate so AutoQuant can be compared against controlled ablations and there is a fallback if AutoQuant misses the best frontier or hits runtime constraints.
Generate candidates
- Delegate checkpoint generation and PTQ validation to ptq.
- Change one major axis at a time: format, calibration algorithm, module selection, granularity, exclusions, or calibration data.
- Use AutoQuant for broad candidate generation and sensitivity reports; use manual recipes for controlled module-family ablations and overrides.
Gate before scaling
- Validate checkpoint coverage and metadata.
- Reject or rewrite recipes that mix quantization algorithms inside a fused runtime group.
- If the checkpoint is valid but serving fails due to runtime support, do not reject the recipe immediately. Delegate to deployment / debug for small patches or flags, then rerun a pipe-clean check.

Iteration Loop

Run cheap screen evals for every candidate that passes the gates.
Compare accuracy, verbosity/token usage, and active cost against baselines.
Rerun noisy or near-threshold results before labeling a regression.
Decide the next candidate:
- Accuracy drop: protect or ablate sensitive module families, try MSE/GPTQ, or use AutoQuant sensitivity to choose overrides.
- Poor performance/cost: quantize the next high-cost active family, adjust active-cost objective, or try a more aggressive format.
- AutoQuant underperforms manual recipes: inspect sensitivity reports, achieved bits, excluded modules, and runtime-fusion constraints; keep the manual recipe in the portfolio instead of forcing the AutoQuant result.
- Runtime incompatibility: rewrite around fused groups or isolate deployment support from checkpoint quality.
- Repeated AutoQuant recipes: inspect achieved bits and recipe hashes, then adjust constraints before launching a larger sweep.
Promote only when compare-results shows the candidate is comparable to the baseline and satisfies the user-defined goal.

Maintain a recipe portfolio table with recipe name, objective, active-cost estimate, calibration notes, checkpoint path, eval/log references, accuracy, verbosity, and decision.

References

For recipe design, search-space details, sensitivity, and active-cost accounting, read references/recipe_iteration.md.
For a concrete prior case study, read references/qwen36_case_study.md only when Qwen3.5/Qwen3.6 details are relevant.