| name | evaluation |
| description | Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq), deploying/serving models (use deployment), or comparing completed baseline-vs-quantized results (use compare-results). |
| license | Apache-2.0 |
NeMo Evaluator Launcher Assistant
You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.
Workspace and Pipeline Integration
If MODELOPT_WORKSPACE_ROOT is set, read skills/common/workspace-management.md. Check for existing workspaces in the current session — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via deployment.command.
Workflow
Config Generation Progress:
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
- [ ] Step 1: Check if nel is installed and if user has existing config
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth for private images (SLURM only)
- [ ] Step 8: Run the evaluation
- [ ] Step 8.1: Dry-run / NEL CLI config validation
- [ ] Step 8.2: Limited-samples canary
- [ ] Step 8.3: Full evaluation
- [ ] Step 9: Verify completed evaluation run
Step 1: Check prerequisites
Test that nel is installed with nel --version. If not, instruct the user to pip install nemo-evaluator-launcher.
If the user already has a config file (e.g., "run this config", "evaluate with my-config.yaml"), skip to Step 8. Optionally review it for common issues (missing ??? values, quantization flags) before running.
Shortcut: use task references. For named benchmarks, read the matching
recipes/tasks/<name>.md before creating or editing the config. Available:
mmlu_pro, mmmu_pro, gpqa, aime2025, livecodebench, ifbench, scicode,
aa_lcr, ns_hle_aa, tau2_bench_telecom.
- Read the task reference(s) the user wants.
- Use
recipes/examples/example_eval.yaml as the base config template
- Copy the selected YAML fragment(s) into
evaluation.tasks.
- Apply any notes from the reference.
- Do Step 3, Step 4, then Step 7.5/8.
Step 2: Build the base config file
Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
- Execution:
- Deployment:
- None (External)
- vLLM
- SGLang
- NIM
- TRT-LLM
Prefer vLLM for NEL self-deployment unless the user explicitly asks for another runtime, the model card requires another runtime, or the evaluation targets an already-running endpoint (deployment: none).
- Auto-export:
- None (auto-export disabled)
- MLflow
- wandb
- Model type
- Benchmarks:
Allow for multiple choices in this question.
- Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
- Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
- Math & Reasoning (like AIME, GPQA, MATH-500, ...)
- Safety & Security (like Garak and Safety Harness)
- Multilingual (like MMATH, Global MMLU, MMLU-Prox)
Only accept options from the categories listed above (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
Note: These categories come from NEL's build-config CLI. Always run nel skills build-config --help first to get the current options — they may differ from this list (e.g., chat_reasoning instead of separate chat/reasoning, general_knowledge instead of standard). When the CLI's current options differ from this list, prefer the CLI's options.
When you have all the answers, run the script to build the base config:
nel skills build-config --execution <local|slurm> --deployment <none|vllm|sglang|nim|trtllm> --model_type <base|chat|reasoning> --benchmarks <standard|code|math_reasoning|safety|multilingual> [--export <none|mlflow|wandb>] [--output <OUTPUT>]
Where --output depends on what the user provides:
- Omit: Uses current directory with auto-generated filename
- Directory: Writes to that directory with auto-generated filename
- File path (*.yaml): Writes to that specific file
It never overwrites existing files.
Step 3: Configure model path and parameters
Ask for model path. Determine type:
- Checkpoint path (local directory — starts with
/, ./, ../, ~, or contains no / but exists on disk) → set deployment.checkpoint_path: <path> and deployment.hf_model_handle: null
- HF handle (e.g.,
org/model-name — contains exactly one / and does not exist locally) → set deployment.hf_model_handle: <handle> and deployment.checkpoint_path: null
Auto-detect ModelOpt quantization format (checkpoint paths only):
Check config.json first for a quantization_config section with quant_method: "modelopt". If absent, check the legacy/backward-compatible hf_quant_config.json:
cat <checkpoint_path>/config.json 2>/dev/null
cat <checkpoint_path>/hf_quant_config.json 2>/dev/null
If ModelOpt quantization is detected, read the quantization algorithm from quantization_config.quant_algo or quantization.quant_algo.
- vLLM: Do not add a
--quantization flag by default. Recent vLLM reads quantization_config / hf_quant_config.json and selects the ModelOpt backend automatically; adding a stale or mismatched flag can cause a config mismatch. Only add an explicit flag if the model card, vLLM version, or dry-run error requires it.
- SGLang: Use SGLang-specific docs/model-card guidance. For offline ModelOpt checkpoints, recent SGLang can parse the config in many cases; if an explicit flag is required, common values are
--quantization modelopt_fp8 for FP8 and --quantization modelopt_fp4 for NVFP4. Some exported ModelOpt flows document --quantization modelopt; verify against the installed SGLang version.
If neither file contains a ModelOpt quantization config, treat the checkpoint as unquantized — no quantization flag needed.
Note: Some models require additional env vars for deployment (e.g., VLLM_NVFP4_GEMM_BACKEND=marlin for Nemotron Super). These may not be in the quantization config files — they are discovered during model card research below.
Auto-detect deployment settings from checkpoint:
Read config.json from the checkpoint (or HF model card) and build deployment.extra_args dynamically:
cat <checkpoint_path>/config.json 2>/dev/null
Field in config.json | What to set | Example |
|---|
max_position_embeddings | --max-model-len <value> | 131072 → --max-model-len 131072 |
auto_map exists | --trust-remote-code | Only add if model has custom code |
Then use WebSearch to check the model card (HuggingFace page) for deployment-specific settings:
| Model card signal | What to set |
|---|
| Reasoning model (thinking/CoT) | --reasoning-parser and --reasoning-parser-plugin if a custom parser is provided |
| Tool-calling support | --enable-auto-tool-choice --tool-call-parser <parser> |
| Custom vLLM flags documented | Add as specified (e.g., --mamba_ssm_cache_dtype float32) |
Combine all detected flags into a single deployment.extra_args override. The recipe's default --max-model-len 32768 is a fallback — always prefer the value from config.json.
Quantization-aware benchmark defaults:
When a quantized checkpoint is detected, read references/quantization-benchmarks.md for benchmark sensitivity rankings and recommended sets. Present recommendations to the user and ask which to include.
Read references/model-card-research.md for the full extraction checklist (sampling params, reasoning config, ARM64 compatibility, pre_cmd, etc.). Use WebSearch to research the model card, present findings, and ask the user to confirm.
For reasoning-capable models, prefer reasoning mode for evaluation because it usually produces the highest task scores; configure the model-card-specific on/off control and any reasoning budget or effort setting. If the user wants lower variance/noise, lower latency/cost, or an apples-to-apples comparison against non-reasoning baselines, also consider a non-reasoning companion run.
Step 4: Fill in remaining missing values
- Find all remaining
??? missing values in the config.
- Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text.
- Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled).
Step 5: Confirm tasks (iterative)
Show tasks in the current config. Loop until the user confirms the task list is final:
-
Tell the user: "Run nel ls tasks to see all available tasks".
-
If the task list includes a benchmark with a reference in recipes/tasks/,
read it before editing the config and prefer its YAML fragment unless the user
asks for different settings. Keep the reference repeat counts.
-
Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides.
To add per-task nemo_evaluator_config as specified by the user, e.g.:
tasks:
- name: <task>
nemo_evaluator_config:
config:
params:
temperature: <value>
max_new_tokens: <value>
...
-
Apply changes.
-
Show updated list and ask: "Is the task list final, or do you want to make more changes?"
Known Issues
-
NeMo-Skills workaround (self-deployment only): If using nemo_skills.* tasks with self-deployment (vLLM/SGLang/NIM), add at top level:
target:
api_endpoint:
api_key_name: DUMMY_API_KEY
For the None (External) deployment the api_key_name should be already defined. The DUMMY_API_KEY export is handled in Step 8.
Step 6: Advanced - Multi-node
If the user needs multi-node evaluation (model >120B, or more throughput), read references/multi-node.md for the configuration patterns (HAProxy multi-instance, Ray TP/PP, or combined).
Step 7: Advanced - Interceptors
- Tell the user they should see: https://docs.nvidia.com/nemo/evaluator/latest/libraries/nemo-evaluator/interceptors/index.html .
- DON'T provide any general information about what interceptors typically do in API frameworks without reading the docs. If the user asks about interceptors, only then read the webpage to provide precise information.
- If the user asks you to configure some interceptor, then read the webpage of this interceptor and configure it according to the
--overrides syntax but put the values in the YAML config under evaluation.nemo_evaluator_config.config.target.api_endpoint.adapter_config (NOT under target.api_endpoint.adapter_config) instead of using CLI overrides.
By defining interceptors list you'd override the full chain of interceptors which can have unintended consequences like disabling default interceptors. That's why use the fields specified in the CLI Configuration section after the --overrides keyword to configure interceptors in the YAML config.
Documentation Errata
- The docs may show incorrect parameter names for logging. Use
max_logged_requests and max_logged_responses (NOT max_saved_* or max_*).
Step 7.5: Check container registry authentication for private images (SLURM only)
NEL's default deployment images by framework:
| Framework | Default image | Registry |
|---|
| vLLM | vllm/vllm-openai:latest | DockerHub |
| SGLang | lmsysorg/sglang:latest | DockerHub |
| TRT-LLM | nvcr.io/nvidia/tensorrt-llm/release:... | NGC |
| Evaluation tasks | nvcr.io/nvidia/eval-factory/*:26.03 | NGC |
Before submitting, identify the exact deployment and evaluation-task images that will be pulled. If the images are public, skip the registry-authentication preflight; pyxis/enroot can pull public images without stored credentials. Do not require credentials just because the registry is DockerHub or NGC.
Only verify cluster credentials when an image is private or access-restricted (private DockerHub repo, private NGC repo, internal registry, or user-provided image that is not known to be public). See skills/common/slurm-setup.md section 6 for the credential setup procedure.
ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
Decision flow (check before submitting):
-
If the selected images are public → submit without an auth preflight
-
If any selected image is private or access-restricted → check for credentials for that image's registry (see command above)
-
If credentials exist → use the selected image and submit
-
If credentials are missing but can be added → add them (see slurm-setup.md section 6), then submit
-
If credentials cannot be added → switch to a public image when a compatible one exists, for example:
deployment:
image: nvcr.io/nvidia/vllm:<YY.MM>-py3
-
Do not retry more than once after an auth failure without fixing credentials or switching images
Step 8: Run the evaluation
Use a gated dry-run -> canary -> full-run sequence. Run the commands directly
when the user has asked you to launch evals; otherwise, ask before submitting jobs.
Do not submit the full run until the dry-run and limited-samples canary both pass.
Important: Export required environment variables based on your config. If any tokens or keys are missing, point the user to recipes/env.example — it lists all possible keys with notes on which tasks need them. Ask the user to copy it, fill in their keys, and source it:
cp recipes/env.example .env
set -a && source .env && set +a
export NEMO_EVALUATOR_TRUST_PRE_CMD=1
export DUMMY_API_KEY=dummy
Step 8.1: Dry-run / NEL CLI config validation (validates config without running):
nel run --config <config_path> --dry-run
Check the NEL output before launching anything. Fix unresolved ??? values,
bad Hydra overrides, missing env var references, invalid mounts, image/container
problems, sbatch issues, and obvious deployment argument errors before moving on.
Step 8.2: Limited-samples canary (operational validation before production):
nel run --config <config_path> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
Use the canary to tune parallelism and catch runtime failures that the dry-run
cannot catch: judge API auth/rate-limit errors, evaluation container failures,
code-execution sandbox/container errors, vLLM health/OOM issues, bad request
formatting, log path problems, and unexpectedly low evaluated-sample counts.
Inspect logs before accepting the canary, not just result files:
nel status <canary_invocation_id>
nel info <canary_invocation_id> --logs
ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"
If the benchmark set mixes different dependency profiles, canary each risky
class or task: LLM-judge tasks, code-execution tasks, and ordinary model-only
tasks can fail for different reasons. For evals that depend on inference judges
or code execution containers, start with conservative parallelism and raise it
only after the canary logs show those dependencies are healthy. Do not over-raise
parallelism just to saturate the model server; judge services and code containers
often become the bottleneck or failure point first.
Single-task rerun (useful for canary debugging or re-testing after config changes):
nel run --config <config_path> -t <task_name>
Combine with -o for limited samples: nel run --config <config_path> -t <task_name> -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
Step 8.3: Full evaluation (production run after the canary passes):
nel run --config <config_path>
Before the full run, remove the limit_samples override and keep only the
parallelism/settings that the canary validated. If the canary fails, fix the
config, credentials, image/container, judge setup, code-execution environment, or
parallelism, then rerun the canary before launching the full evaluation.
Monitoring Progress
After job submission, register the job per the monitor skill for durable
cross-session tracking. For one-off queries (live status, debugging a failed
run, analyzing results) use the launching-evals skill; for querying past
runs in MLflow use accessing-mlflow. If a NEL job times out or resumes,
read references/run-validation.md before treating the run as failed.
Step 9: Verify completed evaluation run
Before pulling/reporting scores, validate the completed run itself. Read
references/run-validation.md for NEL timeout/resume behavior, completed-run
validation, diagnostics, score-harvesting guidance, and the handoff to
compare-results for baseline-vs-candidate deltas.
Direct users with issues to:
Now, copy this checklist and track your progress:
Config Generation Progress:
- [ ] Step 0: Check workspace (if MODELOPT_WORKSPACE_ROOT is set)
- [ ] Step 1: Check if nel is installed and if user has existing config
- [ ] Step 2: Build the base config file
- [ ] Step 3: Configure model path and parameters
- [ ] Step 4: Fill in remaining missing values
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth for private images (SLURM only)
- [ ] Step 8: Run the evaluation
- [ ] Step 8.1: Dry-run / NEL CLI config validation
- [ ] Step 8.2: Limited-samples canary
- [ ] Step 8.3: Full evaluation
- [ ] Step 9: Verify completed evaluation run