Run any Skill in Manus with one click

evaluation

Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).

Run Skill in Manus

Stars2,897

Forks433

UpdatedJune 8, 2026 at 20:59

Source

NVIDIA

NVIDIA/Model-Optimizer

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

File Explorer

20 files

SKILL.md

readonly

More from this repository

same repository

day0-release

NVIDIA/Model-Optimizer

Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).

2026-06-092.9k

ptq

NVIDIA/Model-Optimizer

This skill should be used when the user asks to "quantize a model", "run PTQ", "post-training quantization", "NVFP4 quantization", "FP8 quantization", "INT8 quantization", "INT4 AWQ", "quantize LLM", "quantize MoE", "quantize VLM", or needs to produce a quantized HuggingFace or TensorRT-LLM checkpoint from a pretrained model using ModelOpt.

2026-06-082.9k

deployment

NVIDIA/Model-Optimizer

Serve a quantized or unquantized LLM checkpoint as an OpenAI-compatible API endpoint using vLLM, SGLang, or TRT-LLM. Use when user says "deploy model", "serve model", "start vLLM server", "launch SGLang", "TRT-LLM deploy", "AutoDeploy", "benchmark throughput", "serve checkpoint", or needs an inference endpoint from a HuggingFace or ModelOpt-quantized checkpoint. Do NOT use for quantizing models (use ptq) or evaluating accuracy (use evaluation).

2026-06-072.9k

compare-results

NVIDIA/Model-Optimizer

Establish baseline-vs-candidate evaluation plans, delegate missing evaluations, compare validated results, and decide quantization feasibility. Use when the user asks to compare baseline vs quantized runs, explain an accuracy drop/regression, verify whether a quantized checkpoint is acceptable, or compare NEL/MLflow evaluation outputs. Do NOT use for generic single-model evaluation without comparison intent (use evaluation), live NEL status/debugging (use launching-evals), or generic MLflow browsing without a comparison goal (use accessing-mlflow).

2026-06-052.9k

launching-evals

NVIDIA/Model-Optimizer

Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.

2026-06-052.9k

quant-recipe-search

NVIDIA/Model-Optimizer

Use when the user asks to find, search for, or optimize the best quantization recipe for a model, including direct requests like "find the best quantization recipe and generate a PTQ checkpoint." Guides the multi-candidate loop: choose compute-vs-memory success metrics, select ModelOpt recipe baselines, design AutoQuant/manual recipe deltas, interpret sensitivity, and decide next candidates. Do NOT use for a single known PTQ recipe run (use ptq), serving (use deployment), creating/running evals (use evaluation or launching-evals), monitoring jobs (use monitor), MLflow browsing (use accessing-mlflow), or comparing completed baseline-vs-candidate scores only (use compare-results).

2026-06-052.9k

- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT set) - [ ] Step 1: Check `nel` install + existing config - [ ] Step 2: Build base config (5-question flow OR shortcut) - [ ] Step 3: Configure deployment (model path, params, cross-check) - [ ] Step 4: Fill remaining ??? values - [ ] Step 5: Confirm tasks (iterative) - [ ] Step 6: Multi-node (if needed) - [ ] Step 7: Interceptors (if needed) - [ ] Step 7.5: Container auth (SLURM private images) - [ ] Step 8: Dry-run → canary → full run - [ ] Step 9: Verify completed run

Field

Flag

max_position_embeddings

--max-model-len <value>

auto_map exists

--trust-remote-code

Signal

Flag

max_position_embeddings

--max-model-len <value>

auto_map

--trust-remote-code

Reasoning/CoT documented

--reasoning-parser (and --reasoning-parser-plugin if custom)

Tool-calling documented

--enable-auto-tool-choice --tool-call-parser <parser>

Custom flags in card

Add as specified (e.g. --mamba_ssm_cache_dtype float32)

deployment: command: >- vllm serve /checkpoint --host 0.0.0.0 --port ${deployment.port} --tensor-parallel-size <N> --data-parallel-size <M> --max-model-len <value> <... rest of cross-checked flags ...>

nemo_evaluator_config: config: params: parallelism: ??? # Required — size per references/parallelism.md (bounded by total request count vs GPU serving capacity); ask user in Step 4 if still unclear request_timeout: 3600 max_retries: 10 max_new_tokens: 65536 # see rule below temperature: 1.0 # from model card (reasoning); adjust top_p: 0.95 # from model card (reasoning); adjust

python3 -c 'import nemo_evaluator_launcher_internal' 2>/dev/null && \ PKG=$(python3 -c 'import nemo_evaluator_launcher_internal as m,os;print(os.path.dirname(m.__file__))') && \ for f in "$PKG"/configs/execution/internal/slurm/*.yaml; do \ echo "$(basename "$f" .yaml) -> $(grep -E '^hostname:' "$f" | awk '{print $2}')"; done

Framework

Image

Registry

vLLM

vllm/vllm-openai:v0.19.1 (bump per recipe; never :latest)

DockerHub

vLLM (NVFP4 on B300/GB300)

vllm/vllm-openai:v0.19.1-cu130 (bump to cu130-nightly-<arch> for new archs)

DockerHub

SGLang

lmsysorg/sglang:latest

DockerHub

TRT-LLM

nvcr.io/nvidia/tensorrt-llm/release:...

NGC

Eval tasks

nvcr.io/nvidia/eval-factory/*:26.03

NGC

cp recipes/env.example .env set -a && source .env && set +a # If pre_cmd/post_cmd in config (review pre_cmd first — runs arbitrary commands): export NEMO_EVALUATOR_TRUST_PRE_CMD=1 # If nemo_skills.* + self-deployment, for LOCAL/Docker runs only: export DUMMY_API_KEY=dummy # On SLURM this shell export does NOT reach the container — instead declare # `DUMMY_API_KEY: lit:dummy` under evaluation.env_vars (see Step 5).

nel status <id> nel info <id> --logs ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"

Field

Flag

max_position_embeddings

--max-model-len <value>

auto_map exists

--trust-remote-code

Signal

Flag

max_position_embeddings

--max-model-len <value>

auto_map

--trust-remote-code

Reasoning/CoT documented

--reasoning-parser (and --reasoning-parser-plugin if custom)

Tool-calling documented

--enable-auto-tool-choice --tool-call-parser <parser>

Custom flags in card

Add as specified (e.g. --mamba_ssm_cache_dtype float32)

Framework

Image

Registry

vLLM

vllm/vllm-openai:v0.19.1 (bump per recipe; never :latest)

DockerHub

vLLM (NVFP4 on B300/GB300)

vllm/vllm-openai:v0.19.1-cu130 (bump to cu130-nightly-<arch> for new archs)

DockerHub

SGLang

lmsysorg/sglang:latest

DockerHub

TRT-LLM

nvcr.io/nvidia/tensorrt-llm/release:...

NGC

Eval tasks

nvcr.io/nvidia/eval-factory/*:26.03

NGC

name	evaluation
description	Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results).
license	Apache-2.0

evaluation

More from this repository

More from this repository

NeMo Evaluator Launcher Assistant

Workspace integration

Workflow

Step 1 — Prerequisites

Step 2 — Build base config (when not using shortcut)

Step 3 — Configure deployment

Cross-check both sources for vLLM (mandatory, neither replaces the other)

vLLM deployment command structure — single command: field

vLLM-backend defaults — always include unless the recipe contradicts

Evaluation params template (top-level params)

max_new_tokens — mandatory model-card lookup

Quantization-aware benchmark defaults

Reasoning adapter config (use_reasoning)

Step 4 — Fill remaining ??? values

Step 5 — Confirm tasks (iterative)

Step 6 — Multi-node

Step 7 — Interceptors

Step 7.5 — Container registry auth (SLURM private images only)

Step 8 — Run evaluation (gated dry-run → canary → full)

Step 9 — Verify completed run

NeMo Evaluator Launcher Assistant

Workspace integration

Workflow

Step 1 — Prerequisites

Step 2 — Build base config (when not using shortcut)

Step 3 — Configure deployment

Cross-check both sources for vLLM (mandatory, neither replaces the other)

vLLM deployment command structure — single command: field

vLLM-backend defaults — always include unless the recipe contradicts

Evaluation params template (top-level params)

max_new_tokens — mandatory model-card lookup

Quantization-aware benchmark defaults

Reasoning adapter config (use_reasoning)

Step 4 — Fill remaining ??? values

Step 5 — Confirm tasks (iterative)

Step 6 — Multi-node

Step 7 — Interceptors

Step 7.5 — Container registry auth (SLURM private images only)

Step 8 — Run evaluation (gated dry-run → canary → full)

Step 9 — Verify completed run

vLLM deployment command structure — single `command:` field

`max_new_tokens` — mandatory model-card lookup

Reasoning adapter config (`use_reasoning`)

vLLM deployment command structure — single `command:` field

`max_new_tokens` — mandatory model-card lookup

Reasoning adapter config (`use_reasoning`)