Run any Skill in Manus with one click

day0-release

Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).

Run Skill in Manus

Stars2,897

Forks433

UpdatedJune 9, 2026 at 00:11

Source

NVIDIA

NVIDIA/Model-Optimizer

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

File Explorer

6 files

SKILL.md

readonly

More from this repository

same repository

evaluation

NVIDIA/Model-Optimizer

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

2026-06-082.9k

ptq

NVIDIA/Model-Optimizer

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

2026-06-082.9k

deployment

NVIDIA/Model-Optimizer

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

2026-06-072.9k

compare-results

NVIDIA/Model-Optimizer

Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

2026-06-052.9k

launching-evals

NVIDIA/Model-Optimizer

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-06-052.9k

quant-recipe-search

NVIDIA/Model-Optimizer

Use when the user asks to find, search for, or optimize the best quantization recipe for a model, including direct requests like "find the best quantization recipe and generate a PTQ checkpoint." Guides the multi-candidate loop: choose compute-vs-memory success metrics, select ModelOpt recipe baselines, design AutoQuant/manual recipe deltas, interpret sensitivity, and decide next candidates. Do NOT use for a single known PTQ recipe run (use ptq), serving (use deployment), creating/running evals (use evaluation or launching-evals), monitoring jobs (use monitor), MLflow browsing (use accessing-mlflow), or comparing completed baseline-vs-candidate scores only (use compare-results).

2026-06-052.9k

name	day0-release
description	Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).
license	Apache-2.0

Day-0 Release

Drive a model from a pretrained checkpoint to a publish decision for a quantized checkpoint, in a fixed sequence with a gate after every stage. This skill is a conductor: it sequences the existing domain skills and enforces the gates — it does not re-implement quantization, serving, evaluation, or comparison.

Goal (the default day-0 criterion): a quantized checkpoint smaller than the source, with accuracy drop within the threshold (default <1%) on the standard benchmark set versus the matching baseline, plus a publish recommendation.

When to use

Use only for the full goal-driven release. For a single stage, route to the domain skill directly: quantize → ptq, serve → deployment, evaluate → evaluation, compare two existing runs → compare-results.

Inputs

Resolve these before starting (ask the user for anything missing):

Model — HF handle or checkpoint path.
Recipe / qformat — e.g. nvfp4, fp8, or a recipe path. One candidate for v1.
Cluster / launcher — from clusters.yaml (see skills/common/environment-setup.md).
Eval set — defaults to the AA suite (evaluation/recipes/tasks/aa/).
Threshold — max accuracy drop; default 0.01 (1%).

The chain

setup ─▶ PTQ ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
          │          │                │               │
       gate_ptq   gate_run         gate_run       gate_compare

The evaluation skill deploys the model it evaluates (it stands up its own endpoint per run), so there is no separate deploy stage — a serving failure surfaces through the eval stage's gate (DEPLOYMENT_HEALTH_FAILED) and triages to the deployment skill to debug serving in isolation (see Step 4).

Run each stage by invoking the domain skill, then run its gate before proceeding. Do not advance past a failed gate. Copy this checklist and track progress:

- [ ] Step 0: Resolve inputs; confirm threshold and eval set
- [ ] Step 1: Setup gate — creds present, cluster reachable
- [ ] Step 2: PTQ (ptq skill) → gate_ptq.py
- [ ] Step 3: Baseline eval (evaluation skill, deploys source) → gate_run.py   [skip if cached, see below]
- [ ] Step 4: Quantized eval (evaluation skill, deploys candidate) → gate_run.py
- [ ] Step 5: Compare (compare-results skill) → gate_compare.py → decision
- [ ] Step 6: Closeout — report + publish recommendation

Step 1 — Setup gate

Confirm credentials (skills/common/credentials.md) and cluster reachability (skills/common/remote-execution.md). If either fails, stop with SYSTEMIC — do not start PTQ.

Step 2 — PTQ

Invoke the ptq skill to produce the quantized checkpoint. Then gate:

# The ptq skill's post-PTQ validation produces a validation-summary JSON (size
# ratio + layer-precision counts + metadata diffs; see
# ptq/references/checkpoint-validation.md). v1 gates on that summary:
python .agents/skills/day0-release/scripts/gate_ptq.py --summary <validation-summary.json>
#   add `--recipe <qformat>` to override the recipe recorded in the summary

gate_ptq.py returns JSON {pass, failure_class, detail}. On pass: false, branch on failure_class (see Triage below). Do not evaluate an unvalidated checkpoint.

Step 3 — Baseline eval

The baseline is the source (pre-quantization) model on the same task set and sampling params. Look it up first — if a matching baseline run already exists in MLflow (same model, task set, sampling params), reuse it and skip this stage. Otherwise run it via the evaluation skill (which deploys the source model itself). Gate with gate_run.py.

Step 4 — Quantized eval

Invoke the evaluation skill on the quantized checkpoint, matching the baseline's task set and sampling params. The evaluation skill stands up the serving endpoint itself (it builds the deployment.command, e.g. a vllm serve …), so a serving failure surfaces here as a failed gate_run.py with DEPLOYMENT_HEALTH_FAILED. When that happens, drop to the deployment skill to reproduce and debug serving in isolation (serve the checkpoint standalone, confirm /health + one generation, iterate on flags / TP / image / env vars) rather than burning full eval cycles on a broken endpoint — then carry the working command back into NEL's deployment.command and resume the eval. If the checkpoint genuinely can't serve, POINT_INFEASIBLE. Gate:

python .agents/skills/day0-release/scripts/gate_run.py --run <run-summary.json>

A pass: false here means the run is incomplete or invalid (judge/parse error, dropped samples) — do not compare scores from it.

Step 5 — Compare

Invoke the compare-results skill to produce per-task deltas, then gate:

python .agents/skills/day0-release/scripts/gate_compare.py \
    --baseline <baseline_scores.json> --candidate <candidate_scores.json> \
    --threshold 0.01

The threshold is a fraction of each task's score scale. Most AA tasks report 0-100, but some (e.g. tau2_bench_telecom Result) report 0-1; the gate infers each task's scale (0-1 if both scores are within [0, 1], else 0-100) and normalizes the drop accordingly, so --threshold 0.01 means "≤1 pt on a 0-100 task / ≤0.01 on a 0-1 task" uniformly. Pass --scales '{"task": max}' to override inference if a task's scores happen to fall in an ambiguous range.

Decision from gate_compare.py:

ACCEPT — every task within threshold → go to Step 6.
REGRESSION — one or more tasks exceed threshold. v1 stops here and reports which tasks regressed by how much. (Picking the next recipe and re-running is deferred — see Scope.)
ANOMALOUS — scores present but implausible (e.g. baseline lower than candidate by a large margin, or a task score outside its valid range) → surface to the user.

Step 6 — Closeout

Report the decision with: source vs output size + ratio, per-task baseline / candidate / delta / within-threshold, MLflow run IDs, and a publish recommendation (publish / do-not-publish / needs-human). Archive artifacts to the workspace.

Triage (gate failure → decision)

Map a gate's failure_class to the next action:

`failure_class`	Action
`INFRA_TRANSIENT`	Retry the stage once; if it recurs, `SYSTEMIC`.
`MODEL_UNSUPPORTED`	PATCH: fix the recipe pattern / add model support (ptq skill owns the patch loop), then retry. If unpatchable, `POINT_INFEASIBLE`.
`QUANT_COVERAGE_FAILURE`	PATCH: fix the recipe wildcard so intended layers are covered; re-run PTQ.
`DEPLOYMENT_HEALTH_FAILED`	Drop to the deployment skill: reproduce serving standalone (`/health` + one generation), debug flags / image / TP / env, then carry the working command into NEL's `deployment.command` and retry the eval. If it can't serve, `POINT_INFEASIBLE`.
`EVAL_JUDGE_FAILED`	Usually transient (auth / rate limit) — wait and retry.
`SAMPLE_ACCOUNTING_FAILED`	Investigate dropped/failed samples before trusting scores.
`USER_CONFIG_ERROR`	Stop and ask the user.
`UNKNOWN`	Stop and surface to the user (`NEEDS_HUMAN`).

SYSTEMIC (cluster down, dataset unavailable) aborts the whole run. POINT_INFEASIBLE means this (model, recipe) can't work as configured.

Output

Return a decision, not a raw artifact:

ACCEPT + report + publish recommendation
REGRESSION + which tasks failed the threshold and by how much
ANOMALOUS / INFEASIBLE / NEEDS_HUMAN + reason
Always: workspace path + MLflow run IDs for traceability

Scope (v1)

In v1: the linear chain + gates + report. On REGRESSION, v1 reports and stops. Deferred to a follow-up: the evaluator-optimizer recipe loop (compare → pick the next recipe → re-run PTQ), which needs the bigpareto integration and a shared config/result schema.