| name | nemo-evaluator-plugin |
| description | Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills. |
| metadata | {"owner":"nemo-platform","maturity":"active"} |
| license | Apache-2.0 |
Evaluator Plugin
Use this skill for evaluation tasks against a running NeMo Platform server. The plugin-backed CLI interface is nemo evaluator; the legacy generated nemo evaluation API command group is not the target surface for new guidance.
CLI Interface
Prerequisites
- all commands in this file assume that the shell's working dir is at the root of the Nvidia-NeMo/nemo-platform repo
- activate the Python virtual environment before invoking the
nemo CLI: source .venv/bin/activate
Check plugin status from the CLI:
nemo evaluator info
Metric Types
Explore Available Metrics
To view available metric names, run:
nemo evaluator metric-types
To view a specific metric schema, pass a metric name from the metric_types list above:
nemo evaluator metric-types <metric-name>
Inspect all the registered metric schema contracts:
nemo evaluator evaluate explain
Note: use nemo evaluator evaluate explain as the source of truth for the current plugin input schema. It will return a large json schema response, so strongly prefer nemo evaluator metric-types when you only need metric names and corresponding schemas.
Evaluation Spec
Evaluation spec is a payload that is provided to CLI as an input to execute evaluation.
At a high level, a spec describes:
metrics: bundled Evaluator SDK metric configurations
dataset: inline rows to evaluate or platform FilesetRef that contains the dataset
params: optional Evaluator SDK execution parameters
target: optional model or agent target for online evaluation
See the LLM-judge spec example at assets/specs/llm_as_judge.json.
Metric Bundle Payloads
The checked-in spec examples use bundled SDK metrics. The fields under metrics[*].payload are generated by bundle_metric(metric, CloudpickleMetricBundlePackager()).
To see the pattern for configuring a pre-defined SDK metric, for example ExactMatchMetric, and converting it into bundled metric JSON, inspect build_metric_bundle_example() in generate_example_specs.py and run:
uv run --frozen python skills/nemo-evaluator-plugin/scripts/generate_example_specs.py
Run Evaluations
Run Using File Spec Reference
When using the nemo evaluator evaluate run command, results are saved into local temporary directories and the link is printed to stdout.
Prefer the --spec-file named argument over inline shell JSON because metric bundles include serialized payloads.
Examples of various specs are provided in the assets/specs directory.
Evaluate using exact-match metric
See the spec example at assets/specs/exact_match_metric.json.
nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json
Evaluate using a benchmark metric set
nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_benchmark.json
Evaluate using LLM-Judge metric
Uses an LLM to score responses. See the spec example at assets/specs/llm_as_judge.json.
nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/llm_as_judge.json
Run Evaluation As A Durable Job
Use the nemo evaluator evaluate submit command to create a durable evaluation job. The response of this command returns a job handler object instead of the evaluation result.
nemo evaluator evaluate submit \
--spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json
The submit response includes the generated job's name field, for example nemo-evaluator-zlhn1ecd. Wait for the job to complete, then list and download the job results.
nemo jobs get-status <job-name>
nemo jobs get <job-name>
nemo jobs results list <job-name>
nemo jobs results download aggregate-scores --job <job-name> --output-file aggregate-scores.json
nemo jobs results download row-scores --job <job-name> --output-file row-scores.jsonl
Python SDK Interface
Evaluator Python SDK client is exposed as evaluator variable on NeMoPlatform instance:
from nemo_platform import NeMoPlatform
platform_client = NeMoPlatform(base_url="http://localhost:8080")
status = platform_client.evaluator.plugin_status()
See examples of using the plugin SDK interface in plugin_sdk_examples.py.
Security
Make sure not to print any secrets to stdout since this can be collected as logs
Additional Resources
For LLM-judge setup notes, see LLM Judge Notes.
For evaluator API key auth, see Evaluator API Auth.
For local and cluster troubleshooting, see Evaluation Troubleshooting.