Run any Skill in Manus with one click

$pwd:

compare-results

Name: Compare Results
Author: NVIDIA

// Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

Run Skill in Manus

$ git log --oneline --stat

stars:2,749

forks:405

updated:May 21, 2026 at 22:16

File Explorer

2 files

SKILL.md

readonly

name	compare-results
description	Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).
license	Apache-2.0

Compare Results

Use this to plan and complete a baseline-vs-candidate comparison. The baseline is the reference checkpoint, and the candidate is the checkpoint whose accuracy change is being measured, typically a further quantized version of the baseline.

Workflow

Establish the candidate checkpoint/run and the matching baseline. Infer the baseline from the PTQ source model/checkpoint in the workspace or config used to create the candidate. If it cannot be inferred, ask the user for the baseline checkpoint or an existing baseline invocation/run path.
If a required baseline or candidate evaluation is missing, delegate to the evaluation skill to create, run, and verify it. The companion evaluation config should match benchmark versions, task configs, serving args, token limits, dataset setup, credentials, cluster, and container as closely as possible; change only the model/checkpoint and checkpoint-specific serving or quantization flags.
Fetch the baseline and candidate task list, configs, score artifacts, and logs. If the user provides MLflow runs or invocation IDs, use the accessing-mlflow skill to fetch configs and artifacts.
Confirm each run passed evaluation Step 9, "Verify completed evaluation run", before comparing scores. If not, validate logs, server health, judge/code-execution status, sample accounting, and reasoning parsing before computing deltas.
For each task, use the canonical score field from the matching .claude/skills/evaluation/recipes/tasks/<task>.md Score Extraction section.
Compute exact deltas outside the chat context when there are multiple tasks or repeated runs.
Report comparability and quantized-feasibility verdicts before interpreting the delta as model quality. If the user did not provide an acceptance threshold, report feasibility as inconclusive instead of inventing one.

Comparability Checklist

Before treating a baseline-vs-quantized delta as a model quality result, verify the validated runs are comparable:

Prompt text, system prompt, chat template, and rendered messages match.
Task name, benchmark version, dataset split, container, harness, and task fragment match.
Generation settings match, including temperature, top_p, top_k, max tokens, stop strings, chat-template kwargs, reasoning mode/budget, and task-specific overrides.
Reasoning traces are enabled, disabled, parsed, stripped, or ignored consistently between runs.
The number of evaluated and scored samples/repeats matches for each task and split.
Judge-backed or simulator-backed tasks use the same judge/user model, endpoint class, prompt, and scoring config.
The same accuracy metric and score field is used for both runs.

If any item differs, either rerun with matched settings or label the result as not an apples-to-apples quantization comparison.

Report Format

Include:

Baseline and candidate identifiers.
Per-task metric path, baseline score, candidate score, delta, and stderr if available.
Comparability status for prompt/template, generation settings, sample counts, reasoning handling, judge/simulator setup, and score field.
Comparability verdict: comparable, not comparable, or inconclusive.
Quantization feasibility verdict: acceptable, not acceptable, or inconclusive.

related-skills.json

same repository

evaluation.md

from "NVIDIA/Model-Optimizer"

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

2026-05-222.7k

deployment.md

from "NVIDIA/Model-Optimizer"

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

2026-05-212.7k

launching-evals.md

from "NVIDIA/Model-Optimizer"

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-05-212.7k

monitor.md

from "NVIDIA/Model-Optimizer"

Monitor submitted jobs (PTQ, evaluation, deployment) on SLURM clusters. Use when the user asks "check job status", "is my job done", "monitor my evaluation", "what's the status of the PTQ", "check on job <slurm_job_id>", or after any skill submits a long-running job. Also triggers on "nel status", "squeue", or any request to check progress of a previously submitted job.

2026-05-212.7k

ptq.md

from "NVIDIA/Model-Optimizer"

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

2026-05-212.7k

release-cherry-pick.md

from "NVIDIA/Model-Optimizer"

Cherry-pick merged PRs labeled for a release branch into that branch, then open a PR and apply the cherry-pick-done label. Use when asked to "cherry-pick PRs for release/X.Y.Z", "pick PRs to release branch", or "cherry-pick labeled PRs".

2026-04-272.7k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/Model-Optimizer"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	compare-results
description	Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).
license	Apache-2.0

Compare Results

Workflow

Establish the candidate checkpoint/run and the matching baseline. Infer the baseline from the PTQ source model/checkpoint in the workspace or config used to create the candidate. If it cannot be inferred, ask the user for the baseline checkpoint or an existing baseline invocation/run path.
If a required baseline or candidate evaluation is missing, delegate to the evaluation skill to create, run, and verify it. The companion evaluation config should match benchmark versions, task configs, serving args, token limits, dataset setup, credentials, cluster, and container as closely as possible; change only the model/checkpoint and checkpoint-specific serving or quantization flags.
Fetch the baseline and candidate task list, configs, score artifacts, and logs. If the user provides MLflow runs or invocation IDs, use the accessing-mlflow skill to fetch configs and artifacts.
Confirm each run passed evaluation Step 9, "Verify completed evaluation run", before comparing scores. If not, validate logs, server health, judge/code-execution status, sample accounting, and reasoning parsing before computing deltas.
For each task, use the canonical score field from the matching .claude/skills/evaluation/recipes/tasks/<task>.md Score Extraction section.
Compute exact deltas outside the chat context when there are multiple tasks or repeated runs.
Report comparability and quantized-feasibility verdicts before interpreting the delta as model quality. If the user did not provide an acceptance threshold, report feasibility as inconclusive instead of inventing one.

Comparability Checklist

Before treating a baseline-vs-quantized delta as a model quality result, verify the validated runs are comparable:

Prompt text, system prompt, chat template, and rendered messages match.
Task name, benchmark version, dataset split, container, harness, and task fragment match.
Generation settings match, including temperature, top_p, top_k, max tokens, stop strings, chat-template kwargs, reasoning mode/budget, and task-specific overrides.
Reasoning traces are enabled, disabled, parsed, stripped, or ignored consistently between runs.
The number of evaluated and scored samples/repeats matches for each task and split.
Judge-backed or simulator-backed tasks use the same judge/user model, endpoint class, prompt, and scoring config.
The same accuracy metric and score field is used for both runs.

If any item differs, either rerun with matched settings or label the result as not an apples-to-apples quantization comparison.

Report Format

Include:

Baseline and candidate identifiers.
Per-task metric path, baseline score, candidate score, delta, and stderr if available.
Comparability status for prompt/template, generation settings, sample counts, reasoning handling, judge/simulator setup, and score field.
Comparability verdict: comparable, not comparable, or inconclusive.
Quantization feasibility verdict: acceptable, not acceptable, or inconclusive.

compare-results

Compare Results

Workflow

Comparability Checklist

Report Format

More from this repository

More from this repository

Compare Results

Workflow

Comparability Checklist

Report Format