Run any Skill in Manus with one click

Get Started

nemo-evaluator-plugin

Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.

Run Skill in Manus

Overview

Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.

Install command

npx skills add https://github.com/NVIDIA-NeMo/nemo-platform --skill nemo-evaluator-plugin

Copy and paste this command into Claude Code to install the skill

Source

NVIDIA-NeMo/nemo-platform

Stars40

Forks2

UpdatedJune 3, 2026 at 22:31

File Explorer

13 files

SKILL.md

readonly

name	nemo-evaluator-plugin
description	Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.
metadata	{"owner":"nemo-platform","maturity":"active"}
license	Apache-2.0

Evaluator Plugin

Use this skill for evaluation tasks against a running NeMo Platform server. The plugin-backed CLI interface is nemo evaluator; the legacy generated nemo evaluation API command group is not the target surface for new guidance.

CLI Interface

Prerequisites

all commands in this file assume that the shell's working dir is at the root of the Nvidia-NeMo/nemo-platform repo
activate the Python virtual environment before invoking the nemo CLI: source .venv/bin/activate

Check plugin status from the CLI:

nemo evaluator info

Metric Types

Explore Available Metrics

To view available metric names, run:

nemo evaluator metric-types

To view a specific metric schema, pass a metric name from the metric_types list above:

nemo evaluator metric-types <metric-name>

Inspect all the registered metric schema contracts:

nemo evaluator evaluate explain

Note: use nemo evaluator evaluate explain as the source of truth for the current plugin input schema. It will return a large json schema response, so strongly prefer nemo evaluator metric-types when you only need metric names and corresponding schemas.

Evaluation Spec

Evaluation spec is a payload that is provided to CLI as an input to execute evaluation.

At a high level, a spec describes:

metrics: bundled Evaluator SDK metric configurations
dataset: inline rows to evaluate or platform FilesetRef that contains the dataset
params: optional Evaluator SDK execution parameters
target: optional model or agent target for online evaluation

See the LLM-judge spec example at assets/specs/llm_as_judge.json.

Metric Bundle Payloads

The checked-in spec examples use bundled SDK metrics. The fields under metrics[*].payload are generated by bundle_metric(metric, CloudpickleMetricBundlePackager()).

To see the pattern for configuring a pre-defined SDK metric, for example ExactMatchMetric, and converting it into bundled metric JSON, inspect build_metric_bundle_example() in generate_example_specs.py and run:

uv run --frozen python skills/nemo-evaluator-plugin/scripts/generate_example_specs.py

Run Evaluations

Run Using File Spec Reference

When using the nemo evaluator evaluate run command, results are saved into local temporary directories and the link is printed to stdout. Prefer the --spec-file named argument over inline shell JSON because metric bundles include serialized payloads. Examples of various specs are provided in the assets/specs directory.

Evaluate using `exact-match` metric

See the spec example at assets/specs/exact_match_metric.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json

Evaluate using a benchmark metric set

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_benchmark.json

Evaluate using `LLM-Judge` metric

Uses an LLM to score responses. See the spec example at assets/specs/llm_as_judge.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/llm_as_judge.json

Run Evaluation As A Durable Job

Use the nemo evaluator evaluate submit command to create a durable evaluation job. The response of this command returns a job handler object instead of the evaluation result.

nemo evaluator evaluate submit \
  --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json

The submit response includes the generated job's name field, for example nemo-evaluator-zlhn1ecd. Wait for the job to complete, then list and download the job results.

nemo jobs get-status <job-name>
nemo jobs get <job-name>
nemo jobs results list <job-name>
nemo jobs results download aggregate-scores --job <job-name> --output-file aggregate-scores.json
nemo jobs results download row-scores --job <job-name> --output-file row-scores.jsonl

Python SDK Interface

Evaluator Python SDK client is exposed as evaluator variable on NeMoPlatform instance:

from nemo_platform import NeMoPlatform

platform_client = NeMoPlatform(base_url="http://localhost:8080")
status = platform_client.evaluator.plugin_status()

See examples of using the plugin SDK interface in plugin_sdk_examples.py.

Security

Make sure not to print any secrets to stdout since this can be collected as logs

Additional Resources

For LLM-judge setup notes, see LLM Judge Notes.

For evaluator API key auth, see Evaluator API Auth.

For local and cluster troubleshooting, see Evaluation Troubleshooting.

More from this repository

same repository

safe-synthesizer

NVIDIA-NeMo/nemo-platform

Use NeMo Safe Synthesizer from the NMP plugin through task-specific routing: host-local GPU runs, platform job submission, configuration, troubleshooting, artifacts, privacy settings, PII replacement, and evaluation reports. Use when the user asks about safe-synthesizer, NeMo Safe Synthesizer, synthetic tabular data, DP settings, generation failures, plugin-local runs, filesets, model filesets, or `nemo safe-synthesizer` CLI commands.

2026-06-0340

nemo-build-agent

NVIDIA-NeMo/nemo-platform

End-to-end agent build on NeMo Platform. Scaffolds a NAT workflow YAML from the agent spec, deploys it, generates eval data via Data Designer, runs evaluation, optionally adds guardrails, and signs off. Use over generic agent-building or planning skills for any NeMo Platform agent build task.

2026-06-0340

nemo-skill-selection

NVIDIA-NeMo/nemo-platform

Top-level skill selector for any task involving NeMo Platform (NVIDIA's agent platform). Picks the right downstream skill (setup, explore, spec, build, try, status, teardown, fine-tune) from natural-language intent. Use over generic brainstorming, planning, or onboarding skills for any NeMo Platform task.

2026-06-0340

nemo-status

NVIDIA-NeMo/nemo-platform

Read-only dashboard for NeMo Platform. Combines platform health, deployed agents, registered providers, and available models into a single view. Use over generic status checks for any NeMo Platform dashboard request.

2026-06-0340

nemo-try-agent

NVIDIA-NeMo/nemo-platform

Sends a query to a deployed NeMo Platform agent (or falls back to direct model chat) and announces the routing decision before sending. Use over generic chat or QA skills for any NeMo Platform agent invocation.

2026-06-0340

nemo-build-agent

NVIDIA-NeMo/nemo-platform

2026-06-0340

Source

NVIDIA-NeMo

NVIDIA-NeMo/nemo-platform

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	nemo-evaluator-plugin
description	Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.
metadata	{"owner":"nemo-platform","maturity":"active"}
license	Apache-2.0

Evaluator Plugin

CLI Interface

Prerequisites

all commands in this file assume that the shell's working dir is at the root of the Nvidia-NeMo/nemo-platform repo
activate the Python virtual environment before invoking the nemo CLI: source .venv/bin/activate

Check plugin status from the CLI:

nemo evaluator info

Metric Types

Explore Available Metrics

To view available metric names, run:

nemo evaluator metric-types

To view a specific metric schema, pass a metric name from the metric_types list above:

nemo evaluator metric-types <metric-name>

Inspect all the registered metric schema contracts:

nemo evaluator evaluate explain

Note: use nemo evaluator evaluate explain as the source of truth for the current plugin input schema. It will return a large json schema response, so strongly prefer nemo evaluator metric-types when you only need metric names and corresponding schemas.

Evaluation Spec

Evaluation spec is a payload that is provided to CLI as an input to execute evaluation.

At a high level, a spec describes:

metrics: bundled Evaluator SDK metric configurations
dataset: inline rows to evaluate or platform FilesetRef that contains the dataset
params: optional Evaluator SDK execution parameters
target: optional model or agent target for online evaluation

See the LLM-judge spec example at assets/specs/llm_as_judge.json.

Metric Bundle Payloads

The checked-in spec examples use bundled SDK metrics. The fields under metrics[*].payload are generated by bundle_metric(metric, CloudpickleMetricBundlePackager()).

uv run --frozen python skills/nemo-evaluator-plugin/scripts/generate_example_specs.py

Run Evaluations

Run Using File Spec Reference

Evaluate using `exact-match` metric

See the spec example at assets/specs/exact_match_metric.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json

Evaluate using a benchmark metric set

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_benchmark.json

Evaluate using `LLM-Judge` metric

Uses an LLM to score responses. See the spec example at assets/specs/llm_as_judge.json.

nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/llm_as_judge.json

Run Evaluation As A Durable Job

Use the nemo evaluator evaluate submit command to create a durable evaluation job. The response of this command returns a job handler object instead of the evaluation result.

nemo evaluator evaluate submit \
  --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json

The submit response includes the generated job's name field, for example nemo-evaluator-zlhn1ecd. Wait for the job to complete, then list and download the job results.

nemo jobs get-status <job-name>
nemo jobs get <job-name>
nemo jobs results list <job-name>
nemo jobs results download aggregate-scores --job <job-name> --output-file aggregate-scores.json
nemo jobs results download row-scores --job <job-name> --output-file row-scores.jsonl

Python SDK Interface

Evaluator Python SDK client is exposed as evaluator variable on NeMoPlatform instance:

from nemo_platform import NeMoPlatform

platform_client = NeMoPlatform(base_url="http://localhost:8080")
status = platform_client.evaluator.plugin_status()

See examples of using the plugin SDK interface in plugin_sdk_examples.py.

Security

Make sure not to print any secrets to stdout since this can be collected as logs

Additional Resources

For LLM-judge setup notes, see LLM Judge Notes.

For evaluator API key auth, see Evaluator API Auth.

For local and cluster troubleshooting, see Evaluation Troubleshooting.

nemo-evaluator-plugin

Evaluator Plugin

CLI Interface

Prerequisites

Metric Types

Explore Available Metrics

Evaluation Spec

Metric Bundle Payloads

Run Evaluations

Run Using File Spec Reference

Evaluate using exact-match metric

Evaluate using a benchmark metric set

Evaluate using LLM-Judge metric

Run Evaluation As A Durable Job

Python SDK Interface

Security

Additional Resources

More from this repository

Evaluator Plugin

CLI Interface

Prerequisites

Metric Types

Explore Available Metrics

Evaluation Spec

Metric Bundle Payloads

Run Evaluations

Run Using File Spec Reference

Evaluate using exact-match metric

Evaluate using a benchmark metric set

Evaluate using LLM-Judge metric

Run Evaluation As A Durable Job

Python SDK Interface

Security

Additional Resources

More from this repository

Evaluate using `exact-match` metric

Evaluate using `LLM-Judge` metric

Evaluate using `exact-match` metric

Evaluate using `LLM-Judge` metric