Run any Skill in Manus with one click

compare-results

Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

Run Skill in Manus

Stars2,897

Forks433

UpdatedJune 5, 2026 at 23:12

Source

NVIDIA

NVIDIA/Model-Optimizer

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

File Explorer

2 files

SKILL.md

readonly

name	compare-results
description	Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).
license	Apache-2.0

Compare Results

Use this to plan and complete a baseline-vs-candidate comparison. The baseline is the reference checkpoint, and the candidate is the checkpoint whose accuracy change is being measured, typically a further quantized version of the baseline.

Workflow

Establish the candidate checkpoint/run and the matching baseline. Infer the baseline from the PTQ source model/checkpoint in the workspace or config used to create the candidate. If it cannot be inferred, ask the user for the baseline checkpoint or an existing baseline invocation/run path.
If a required baseline or candidate evaluation is missing, delegate to the evaluation skill to create, run, and verify it. The companion evaluation config should match benchmark versions, task configs, serving args, token limits, dataset setup, credentials, cluster, and container as closely as possible; change only the model/checkpoint and checkpoint-specific serving or quantization flags.
Fetch the baseline and candidate task list, configs, score artifacts, and logs. If the user provides MLflow runs or invocation IDs, use the accessing-mlflow skill to fetch configs and artifacts.
Confirm each run passed evaluation Step 9, "Verify completed evaluation run", before comparing scores. If not, validate logs, server health, judge/code-execution status, sample accounting, and reasoning parsing before computing deltas.
For each task, use the canonical score field from the matching .agents/skills/evaluation/recipes/tasks/<task>.md Score Extraction section.
Compute exact deltas outside the chat context when there are multiple tasks or repeated runs.
Report comparability and quantized-feasibility verdicts before interpreting the delta as model quality. If the user did not provide an acceptance threshold, report feasibility as inconclusive instead of inventing one.

Comparability Checklist

Before treating a baseline-vs-quantized delta as a model quality result, verify the validated runs are comparable:

Prompt text, system prompt, chat template, and rendered messages match.
Task name, benchmark version, dataset split, container, harness, and task fragment match.
Generation settings match, including temperature, top_p, top_k, max tokens, stop strings, chat-template kwargs, reasoning mode/budget, and task-specific overrides.
Reasoning traces are enabled, disabled, parsed, stripped, or ignored consistently between runs.
The number of evaluated and scored samples/repeats matches for each task and split.
Judge-backed or simulator-backed tasks use the same judge/user model, endpoint class, prompt, and scoring config.
The same accuracy metric and score field is used for both runs.

If any item differs, either rerun with matched settings or label the result as not an apples-to-apples quantization comparison.

Report Format

Include:

Baseline and candidate identifiers.
Per-task metric path, baseline score, candidate score, delta, and stderr if available.
Comparability status for prompt/template, generation settings, sample counts, reasoning handling, judge/simulator setup, and score field.
Comparability verdict: comparable, not comparable, or inconclusive.
Quantization feasibility verdict: acceptable, not acceptable, or inconclusive.

More from this repository

same repository

day0-release

NVIDIA/Model-Optimizer

Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).

2026-06-092.9k

evaluation

NVIDIA/Model-Optimizer

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

2026-06-082.9k

ptq

NVIDIA/Model-Optimizer

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

2026-06-082.9k

deployment

NVIDIA/Model-Optimizer

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

2026-06-072.9k

launching-evals

NVIDIA/Model-Optimizer

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-06-052.9k

quant-recipe-search

NVIDIA/Model-Optimizer

Use when the user asks to find, search for, or optimize the best quantization recipe for a model, including direct requests like "find the best quantization recipe and generate a PTQ checkpoint." Guides the multi-candidate loop: choose compute-vs-memory success metrics, select ModelOpt recipe baselines, design AutoQuant/manual recipe deltas, interpret sensitivity, and decide next candidates. Do NOT use for a single known PTQ recipe run (use ptq), serving (use deployment), creating/running evals (use evaluation or launching-evals), monitoring jobs (use monitor), MLflow browsing (use accessing-mlflow), or comparing completed baseline-vs-candidate scores only (use compare-results).

2026-06-052.9k

name	compare-results
description	Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).
license	Apache-2.0

Compare Results

Workflow

Establish the candidate checkpoint/run and the matching baseline. Infer the baseline from the PTQ source model/checkpoint in the workspace or config used to create the candidate. If it cannot be inferred, ask the user for the baseline checkpoint or an existing baseline invocation/run path.
If a required baseline or candidate evaluation is missing, delegate to the evaluation skill to create, run, and verify it. The companion evaluation config should match benchmark versions, task configs, serving args, token limits, dataset setup, credentials, cluster, and container as closely as possible; change only the model/checkpoint and checkpoint-specific serving or quantization flags.
Fetch the baseline and candidate task list, configs, score artifacts, and logs. If the user provides MLflow runs or invocation IDs, use the accessing-mlflow skill to fetch configs and artifacts.
Confirm each run passed evaluation Step 9, "Verify completed evaluation run", before comparing scores. If not, validate logs, server health, judge/code-execution status, sample accounting, and reasoning parsing before computing deltas.
For each task, use the canonical score field from the matching .agents/skills/evaluation/recipes/tasks/<task>.md Score Extraction section.
Compute exact deltas outside the chat context when there are multiple tasks or repeated runs.
Report comparability and quantized-feasibility verdicts before interpreting the delta as model quality. If the user did not provide an acceptance threshold, report feasibility as inconclusive instead of inventing one.

Comparability Checklist

Before treating a baseline-vs-quantized delta as a model quality result, verify the validated runs are comparable:

Prompt text, system prompt, chat template, and rendered messages match.
Task name, benchmark version, dataset split, container, harness, and task fragment match.
Generation settings match, including temperature, top_p, top_k, max tokens, stop strings, chat-template kwargs, reasoning mode/budget, and task-specific overrides.
Reasoning traces are enabled, disabled, parsed, stripped, or ignored consistently between runs.
The number of evaluated and scored samples/repeats matches for each task and split.
Judge-backed or simulator-backed tasks use the same judge/user model, endpoint class, prompt, and scoring config.
The same accuracy metric and score field is used for both runs.

If any item differs, either rerun with matched settings or label the result as not an apples-to-apples quantization comparison.

Report Format

Include:

Baseline and candidate identifiers.
Per-task metric path, baseline score, candidate score, delta, and stderr if available.
Comparability status for prompt/template, generation settings, sample counts, reasoning handling, judge/simulator setup, and score field.
Comparability verdict: comparable, not comparable, or inconclusive.
Quantization feasibility verdict: acceptable, not acceptable, or inconclusive.