| name | evaluation |
| description | Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results). |
| license | Apache-2.0 |
NeMo Evaluator Launcher Assistant
Guide the user through creating NEL YAML configs, running evaluations, and monitoring progress.
Workspace integration
If MODELOPT_WORKSPACE_ROOT is set, read skills/common/workspace-management.md and reuse existing workspaces (this skill is usually the final stage of PTQ → Deploy → Eval; carry any deployment-time patches into deployment.command).
Workflow
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT set)
- [ ] Step 1: Check `nel` install + existing config
- [ ] Step 2: Build base config (5-question flow OR shortcut)
- [ ] Step 3: Configure deployment (model path, params, cross-check)
- [ ] Step 4: Fill remaining ??? values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Multi-node (if needed)
- [ ] Step 7: Interceptors (if needed)
- [ ] Step 7.5: Container auth (SLURM private images)
- [ ] Step 8: Dry-run → canary → full run
- [ ] Step 9: Verify completed run
Step 1 — Prerequisites
Run nel --version; if missing, instruct pip install nemo-evaluator-launcher. If user has an existing config, skip to Step 8 (optionally review for ??? and quantization flags first).
Task recipes (always read before editing the relevant task in the config):
- AA Index v2 suite (default for quantized-checkpoint validation, see
references/quantization-benchmarks.md): recipes/tasks/aa/{gpqa_diamond,hle,lcr,scicode,ifbench,mmmu_pro,tau2_bench_telecom}.md
- Optional:
recipes/tasks/mmlu_pro.md, recipes/tasks/aime_2025.md, recipes/tasks/livecodebench.md
AA rule: If the user mentions "AA" / "Artificial Analysis", generate only tasks under recipes/tasks/aa/. Do not add MMLU-Pro, AIME 2025, or LiveCodeBench unless explicitly asked.
Shortcut path (when task list is known up front, e.g. "run AA"):
- Read the task reference file(s).
- Use
recipes/examples/example_eval.yaml as the base.
- Copy the YAML fragment(s) into
evaluation.tasks, applying any per-task notes.
- MLflow auto-export is on by default — it needs two pieces, both in
example_eval.yaml: (a) the trigger execution.auto_export.destinations: [mlflow] (without it the run is not uploaded), and (b) the export.mlflow block that configures it. In the export.mlflow block use literal values for experiment_name / description / tags — substitute the actual served_model_name and sampling params. Do not use ${deployment.*} / ${evaluation.*} cross-references: with auto-export on, NEL resolves the export block at submit time in a scope without those nodes and fails with Interpolation key '...' not found (${oc.env:USER} is fine — it's an env var). Because these literals can't interpolate, keep the temperature / top_p / max_new_tokens tags equal to the top-level params and update both in the same edit — they're the only queryable record of sampling in MLflow (NEL doesn't log them as run params), so a stale tag silently misreports the run. Fill tracking_uri in Step 4.
- Proceed to Step 3, then Step 4, then Step 7.5/8. Skip Step 2's 5-question flow.
Step 2 — Build base config (when not using shortcut)
Ask the 5 questions via AskUserQuestion (categories must match nel skills build-config --help — run that first to confirm the current option names; CLI options override this list).
- Execution: Local / SLURM
- Deployment: None (External) / vLLM / SGLang / NIM / TRT-LLM. Prefer vLLM unless the user/card says otherwise.
- Auto-export: None / MLflow / wandb
- Model type: Base / Chat / Reasoning
- Benchmarks (multi-select): standard / code / math_reasoning / safety / multilingual
Build the base:
nel skills build-config --execution <...> --deployment <...> --model_type <...> --benchmarks <...> [--export <...>] [--output <...>]
(--output omitted = cwd auto-named; directory = dir + auto-name; *.yaml = exact path. Never overwrites.)
Step 3 — Configure deployment
Model path. Checkpoint path (/, ./, ../, ~, or exists on disk) → set deployment.checkpoint_path, leave hf_model_handle: null. Else HF handle (one /, not on disk) → set deployment.hf_model_handle, leave checkpoint_path: null.
Prefer checkpoint_path over hf_model_handle on SLURM — hf_model_handle isn't reliably mounted at /checkpoint, so the deploy dies with HFValidationError. To eval an un-staged HF model, stage it first (huggingface_hub.snapshot_download) and point checkpoint_path at it. See example_eval.yaml for why.
Auto-detect ModelOpt quantization (checkpoint paths). Check config.json for quantization_config (or legacy hf_quant_config.json):
- vLLM: no
--quantization flag by default — vLLM auto-detects from quantization_config / hf_quant_config.json. Add only when the card, vLLM version, or dry-run error requires it.
- SGLang: may need
--quantization modelopt_fp8 / modelopt_fp4 / modelopt — verify against installed version.
Some models need extra vLLM backend env vars (model-card research) — e.g. VLLM_NVFP4_GEMM_BACKEND=marlin (Nemotron Super), or VLLM_USE_FLASHINFER_MOE_FP4=1 + VLLM_FLASHINFER_MOE_BACKEND=throughput (NVFP4 MoE, e.g. NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4). Put them in deployment.env_vars (not command) with the lit: prefix (VLLM_USE_FLASHINFER_MOE_FP4: lit:1); see example_eval.yaml and Step 5's prefix rule.
Auto-detect from config.json:
| Field | Flag |
|---|
max_position_embeddings | --max-model-len <value> |
auto_map exists | --trust-remote-code |
Cross-check both sources for vLLM (mandatory, neither replaces the other)
Source 1 — recipes.vllm.ai/<org>/<model> (curated vLLM recipes; authoritative for parallelism, family-specific flags like --reasoning-parser / --tool-call-parser / --mm-encoder-tp-mode, vLLM version, spec-decoding, GPU count). Pin variants via query params (e.g. ?variant=fp8&strategy=single_node_tep).
WebFetch caveat — triage the summary:
- "No
vllm serve commands found" / "page is a usage guide": JS-rendering miss. recipes.vllm.ai pages always have ≥1 command. Ask the user to paste it or share the variant URL.
- Single recipe returned for a model with known multiple variants → retry with variant-pinned URL. Axis names differ per model (Qwen:
?variant=&strategy=; Kimi: ?advanced=; others vary — no fixed pattern).
- Variant label contradicts the command (e.g. label "TEP" but command shows DP+EP) → summarizer conflated variants; ask user.
For non-trivial deployments (≥120B, multi-node, novel arch), ask the user which variant before fetching.
Source 2 — HF model card + config.json (authoritative for):
| Signal | Flag |
|---|
max_position_embeddings | --max-model-len <value> |
auto_map | --trust-remote-code |
| Reasoning/CoT documented | --reasoning-parser (and --reasoning-parser-plugin if custom) |
| Tool-calling documented | --enable-auto-tool-choice --tool-call-parser <parser> |
| Custom flags in card | Add as specified (e.g. --mamba_ssm_cache_dtype float32) |
Cross-check rules:
- Read both sources before composing the command.
- Agree → use with confidence.
- Disagree → do not silently pick one. Surface both values to the user. Common conflicts: stale cards, parser rename between generations (Qwen2.5
hermes → Qwen3 qwen3_coder), recipe-only flags like --language-model-only, ARM64-specific card notes.
- Resolve in Step 3 — don't defer to dry-run.
vLLM deployment command structure — single command: field
Rewrite the build-config output into one command: field. Move all parallelism (--tensor-parallel-size, --data-parallel-size, --pipeline-parallel-size) into the command; do not keep separate tensor_parallel_size / data_parallel_size / extra_args YAML fields.
deployment:
command: >-
vllm serve /checkpoint
--host 0.0.0.0
--port ${deployment.port}
--tensor-parallel-size <N>
--data-parallel-size <M>
--max-model-len <value>
<... rest of cross-checked flags ...>
Conventions: always start vllm serve /checkpoint (NEL mounts here); always --served-model-name ${deployment.served_model_name} (required; see example_eval.yaml for why); always --host 0.0.0.0 --port ${deployment.port}; use folded scalar (>-) for one flag per line. Example fallback --max-model-len 131072 covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer config.json / recipe value.
For how to choose --tensor-parallel-size / --data-parallel-size / --pipeline-parallel-size (and EP) from the model size and your GPU count, read references/parallelism.md — cross-check the layout against recipes.vllm.ai, then adapt to the GPUs you actually have via the fit math there.
Image / vLLM version. Default image: vllm/vllm-openai:v0.19.1 (pinned for reproducibility). If recipes.vllm.ai states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. v0.20.0) — do not stay on 0.19.1 when the recipe explicitly requires newer. Do not use :latest (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.
NVFP4 on Blackwell B300/GB300 (sm_103): append -cu130 to the image tag (e.g. vllm/vllm-openai:v0.19.1-cu130 — release tags are multi-arch). The default cu12 build has no sm_103 FP4 kernel, so engine init dies with CUDA error: no kernel image is available. If a pinned release predates the model's arch, use cu130-nightly-<arch> (Qwen3.5-9B's qwen3_5 needed it, vLLM 0.19.2rc1.dev134). Multimodal on sm_103 may also need --mm-encoder-attn-backend TRITON_ATTN. Full note in recipes/examples/example_eval.yaml.
vLLM-backend defaults — always include unless the recipe contradicts
Silence is not contradiction. Drop/override only when the recipe sets a different value for the same setting (e.g. recipe pins --max-num-batched-tokens 16384 → use 16384).
-
--max-num-batched-tokens 8192 — caps per-step batched tokens; prevents long-prefill stalls.
-
--enable-chunked-prefill — interleaves long prefills with decode steps (required for AA-LCR's ~120K input). Modern vLLM defaults this on for many models; set explicitly to avoid drift.
-
--enable-expert-parallel — MoE-only default. Detect MoE from handle suffix (-A10B, -A3B, etc.), num_experts / num_local_experts / n_routed_experts in config.json, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models. See references/parallelism.md for what EP does and the DP-attention + EP-MoE throughput pattern.
-
--max-num-seqs N — omit at generation time (top-level parallelism is ???). Add this comment above command::
# After filling in `parallelism` values (top-level + per-task overrides),
# append `--max-num-seqs N` where N = ceil(max_parallelism / data_parallel_size).
In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → ceil(128/8)=16. Too small → request queuing; too large → wasted KV reservation. For how to choose the parallelism it derives from, read references/parallelism.md.
Evaluation params template (top-level params)
The top-level nemo_evaluator_config.config.params must contain exactly these six fields — no top_k / presence_penalty / repetition_penalty / min_p:
nemo_evaluator_config:
config:
params:
parallelism: ???
request_timeout: 3600
max_retries: 10
max_new_tokens: 65536
temperature: 1.0
top_p: 0.95
Per-task max_new_tokens overrides are forbidden — set one top-level ceiling everywhere.
max_new_tokens — mandatory model-card lookup
- Fetch the HF model card before writing the value. Not optional.
- Scan for any
max_tokens / max_new_tokens / "output length" recommendation. Pick the highest value the card mentions (Qwen3.6: 32768 general + 81920 math-coding → use 81920). Annotate with a citing comment.
- If the card is genuinely silent after a thorough read, fall back to: 65536 (reasoning), 16384 (non-reasoning); surface the silence to the user.
- Forbidden: writing
max_new_tokens: <generic_default> with a "card not yet checked" comment. Either fetch and apply, or fetch and confirm silence.
- A higher cap doesn't fix runaway reasoning. On hard tasks (e.g. HLE) a non-terminating model just rambles to the larger cap (~80% length-capped at 131072), and the cap only helps if deployment
--max-model-len > prompt + max_new_tokens (else generation is silently clipped — AA-LCR's ~120K input leaves little room). Treat such tasks as low-confidence.
Quantization-aware benchmark defaults
For quantized checkpoints, read references/quantization-benchmarks.md for sensitivity rankings and recommended sets; present and ask which to include. Read references/model-card-research.md for the full extraction checklist (sampling, reasoning config, ARM64, pre_cmd, output length — see the dedicated bullet there).
Reasoning models: prefer reasoning mode (highest scores). For lower variance / cost / apples-to-apples vs non-reasoning baselines, also consider a non-reasoning companion run.
Reasoning adapter config (use_reasoning)
The adapter_config block in example_eval.yaml controls request/response
logging and reasoning handling. use_reasoning: true strips the model's
reasoning/CoT trace before scoring (grade only the final answer). Set per type:
- Instruct →
use_reasoning: false and drop the chat_template_kwargs
thinking block (no trace to strip; can mangle plain responses).
- Reasoning →
use_reasoning: true, especially when the deployment sets
--reasoning-parser (vLLM emits a separate reasoning channel to strip).
- Hybrid (reasoning on or off) → turn it ON (
use_reasoning: true +
force the thinking flag in chat_template_kwargs). For the exact toggle key
(it drifts across generations) and the reasoning-effort policy, see
references/model-card-research.md → "Reasoning config".
Step 4 — Fill remaining ??? values
Predefined per-cluster execution config (check FIRST). Some installs ship internal/slurm/<cluster> execution groups (optional nemo_evaluator_launcher_internal pkg) that pre-fill hostname/partition/gres — leaving only account/output_dir/walltime. Discover at runtime (nothing cluster-specific hardcoded):
python3 -c 'import nemo_evaluator_launcher_internal' 2>/dev/null && \
PKG=$(python3 -c 'import nemo_evaluator_launcher_internal as m,os;print(os.path.dirname(m.__file__))') && \
for f in "$PKG"/configs/execution/internal/slurm/*.yaml; do \
echo "$(basename "$f" .yaml) -> $(grep -E '^hostname:' "$f" | awk '{print $2}')"; done
Hostname match → set defaults: - execution: internal/slurm/<cluster>, drop the redundant execution.hostname (keep account/output_dir/walltime), verify with --dry-run. Else keep slurm/default and fill hostname/account/output_dir manually.
On SLURM, several deploy/eval failures are invisible to --dry-run and only surface at canary (mount_home, HF cache, cpu_partition, top-level vs per-stage env_vars) — read references/slurm.md.
- Find every
??? left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
parallelism — size it yourself from the run shape (total requests = dataset_size × repeats vs GPU serving capacity), and set --max-num-seqs to match. Read references/parallelism.md for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
- Ask about other defaults they may want to change (partition, walltime, MLflow tags).
execution.gres — auto-set if you used a predefined internal/slurm/<cluster> config (above). On the slurm/default fallback it's gpu:8, so set it to the node's GPU count (and match --data-parallel-size/--tensor-parallel-size) or sbatch rejects the job with "Requested node configuration is not available" (e.g. 4-GPU GB300 → gres: gpu:4; check with sinfo -o '%P %G').
Walltime cap: 4 hours. Always execution.walltime: "04:00:00". The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
Evals that exceed 4h of wall-clock time are handled by NEL's built-in dependency-chain resume, not by shrinking the eval. NEL submits the first SLURM job; if it hits walltime, a dependent follow-on job resumes from the response/result caches the first job wrote, then queues another follow-on. Long evals continue across walltime windows automatically. See references/run-validation.md#nel-timeout-and-resume-behavior for the full mechanism.
Implications for the agent:
- Do not lower
num_repeats, split heavy tasks (AA-LCR, SciCode) into separate configs, or otherwise carve up the eval to fit inside 4h. Let NEL chain.
- Do not treat a walltime timeout as a failed run. Check
nel status / nel info and the dependent job's logs before declaring failure. references/run-validation.md covers what a real failure looks like vs an expected resume event.
- Bumping
data_parallel_size / parallelism to finish faster is fine when the goal is wall-clock latency, not a walltime workaround — but it's optional, not required, for runs longer than 4h.
Step 5 — Confirm tasks (iterative)
-
Tell user: "Run nel ls tasks for the full task list."
-
For any task with a recipes/tasks/ reference, read it and prefer its YAML fragment + repeat counts.
-
Ask about add/remove/modify. Per-task overrides under task's nemo_evaluator_config.config.params:
tasks:
- name: <task>
nemo_evaluator_config:
config:
params:
temperature: <value>
...
-
Apply, show updated list, ask "Final, or more changes?" Loop until confirmed.
Tasks that call an external judge / user-simulator / scoring endpoint. Treat this as a general pattern, not a fixed list — HLE, AA-LCR, and Tau2 need one today, but other benchmarks may too (check each task's recipe). Their model_id / url are config, not secrets: substitute the literal values the user keeps in .env (keys per the task's recipe + recipes/env.example) into the task's <VAR> placeholders. Do not emit ${oc.env:...} for these (it silently fails unless the var was exported with set -a). Only api_key stays an env-var name (e.g. INFERENCE_API_KEY), exported and read by the harness. All nemo-skills/tau2 judges + user-sims (HLE, AA-LCR, Tau2) use one INFERENCE_API_KEY against one OpenAI-compatible host — not build.nvidia.com's JUDGE_API_KEY (that's for simple-evals, e.g. AIME). Get the *_MODEL_ID/*_URL values from the eval-config skill if your org ships one, rather than guessing a host.
Known issue — nemo-skills self-deployment: If using nemo_skills.* tasks (ns_*) with self-deployment (vLLM/SGLang/NIM), you need both of these:
evaluation:
env_vars:
DUMMY_API_KEY: lit:dummy
nemo_evaluator_config:
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
api_key_name only names the env var; the nemo-skills client hard-fails if that var has no value inside the eval container (ValueError: api_key_env_var=DUMMY_API_KEY but the value is not set). On SLURM, a shell export DUMMY_API_KEY=dummy (Step 8) does NOT propagate into the container — NEL only injects vars declared in env_vars. So declare DUMMY_API_KEY: lit:dummy under evaluation.env_vars (note the lit: prefix — see below). The shell export only helps for local/Docker runs. External-deployment configs already define api_key_name.
NEL env-var value prefixes (required): every value in an env_vars map needs an explicit prefix — host:VAR (read from the submitting shell's env at submit time), lit:value (literal string), or runtime:VAR (read in the job at run time). A bare value (e.g. DUMMY_API_KEY: dummy) hard-errors: "Env var value '…' must have an explicit prefix." Use lit: for constants like DUMMY_API_KEY and VLLM_* backend selectors, host: for secrets like HF_TOKEN / INFERENCE_API_KEY.
Step 6 — Multi-node
For models > ~120B or higher throughput needs, read references/multi-node.md for HAProxy multi-instance / Ray TP/PP / combined patterns.
Step 7 — Interceptors
Direct user to https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html. Do not provide generic interceptor info — read the specific interceptor's page if asked, then configure via evaluation.nemo_evaluator_config.target.api_endpoint.adapter_config (target is a sibling of config, not nested under it). Use the per-field syntax from the CLI Configuration section, not a full interceptors: list (that overrides the default chain).
Errata: Logging field names are max_logged_requests / max_logged_responses (NOT max_saved_* / max_* as some docs show).
Step 7.5 — Container registry auth (SLURM private images only)
Default images:
| Framework | Image | Registry |
|---|
| vLLM | vllm/vllm-openai:v0.19.1 (bump per recipe; never :latest) | DockerHub |
| vLLM (NVFP4 on B300/GB300) | vllm/vllm-openai:v0.19.1-cu130 (bump to cu130-nightly-<arch> for new archs) | DockerHub |
| SGLang | lmsysorg/sglang:latest | DockerHub |
| TRT-LLM | nvcr.io/nvidia/tensorrt-llm/release:... | NGC |
| Eval tasks | nvcr.io/nvidia/eval-factory/*:26.03 | NGC |
NVFP4 checkpoints on B300/GB300 (sm_103) need the cu130 image — cu129/v0.19.1 lack sm_103 FP4 kernels (see the "NVFP4 on Blackwell" note in Step 3).
Public images → submit without preflight. Private/restricted → check credentials:
ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
Add credentials per skills/common/slurm-setup.md §6 if missing. If you can't add, switch to a compatible public image (e.g. nvcr.io/nvidia/vllm:<YY.MM>-py3 — check catalog.ngc.nvidia.com). Do not retry more than once after an auth failure.
Step 8 — Run evaluation (gated dry-run → canary → full)
Run directly when the user asked to launch; otherwise ask before submitting.
Env setup: Copy recipes/env.example → .env, fill, source:
cp recipes/env.example .env
set -a && source .env && set +a
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
export DUMMY_API_KEY=dummy
Step 8.1 — Dry-run (config validation):
nel run --config <path> --dry-run
Fix unresolved ???, bad Hydra overrides, missing env vars, invalid mounts, image issues, sbatch errors, obvious deployment errors before proceeding.
Non-fatal noise: "Failed to get manifest"/401/404, "Could not extract frame definition file", "proceeding with minimal task definition", "Found N unlisted task(s)" — expected for ns_*/recipe tasks and private (gitlab) containers; the task still runs in-container. Set NEMO_EVALUATOR_TRUST_UNLISTED_TASKS=1. Real blockers: unresolved ???, interpolation errors, bad mounts, sbatch rejections.
Step 8.2 — Canary (limited-samples, validates everything dry-run can't):
nel run --config <path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
Catches judge auth/rate-limits, container failures, sandbox issues, OOM, bad request formatting, low evaluated counts. Always inspect logs:
nel status <id>
nel info <id> --logs
ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"
Canary each risky task class separately (judge-scored, code-execution, model-only). Start parallelism conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model. For capacity-bound runs, tune parallelism/--max-num-seqs here against vLLM's reported max concurrency + preemption — see references/parallelism.md.
Single-task rerun: nel run --config <path> -t <task_name> (combine with -o ++...limit_samples=10 for canary).
Step 8.3 — Full run (after canary passes):
nel run --config <path>
Remove limit_samples overrides; keep canary-validated parallelism. If the canary fails, fix and rerun the canary — don't skip to full.
Monitoring: Register the job per the monitor skill for cross-session tracking. One-off live status / debugging → launching-evals skill. Past-run MLflow queries → accessing-mlflow skill. NEL timeout/resume → read references/run-validation.md before treating the run as failed.
Step 9 — Verify completed run
Before pulling/reporting scores, validate the run. Read references/run-validation.md for NEL timeout/resume behavior, completed-run validation, diagnostics, score harvesting, and the handoff to compare-results for baseline-vs-candidate deltas.
Issues: https://github.com/NVIDIA-NeMo/Evaluator/issues · https://github.com/NVIDIA-NeMo/Evaluator/discussions