| name | evaluate-environments |
| description | Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps. |
Evaluate Environments
Goal
Run reliable environment evaluations and produce actionable summaries, not raw logs.
Canonical Eval Path
- Use
prime eval run as the default way to run evaluations.
- Do not add
--skip-upload or other opt-out flags unless the user explicitly requests that deviation.
- Standard
prime eval run runs save results automatically, keeping them available in the user's private Evaluations tab and locally in prime eval view.
- For Prime Inference models with available pricing, eval output and saved metadata include estimated total-run USD cost automatically; no extra flags or API-key handling are needed.
Core Loop
- Run a smoke evaluation first (do not require pre-install):
prime eval run my-env -m openai/gpt-4.1-mini -n 5
- Use owner/env slug directly when evaluating Hub environments:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 5
- Scale only after smoke pass:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s
- Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.
- When the user asks for a "real" or "base" eval, do not substitute a tiny smoke run. Use the requested model/env and make the run size explicit before interpreting results.
- If the user says the defaults are fine or asks for no flags, use the shortest canonical command and rely on global config:
prime eval run my-env
prime eval run my-env -m openai/gpt-4.1-mini
Endpoint Shortcuts And Model Family Choice
- Encourage users to define endpoint aliases in
configs/endpoints.toml so model, base URL, and key wiring stay reusable.
- Use aliases via
-m <endpoint_id> instead of repeating -b and -k.
- Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
- Instruct go-tos for quick behavior checks:
gpt-4.1 series and qwen3 instruct series.
- Reasoning go-tos for deeper test coverage:
gpt-5 series, qwen3 thinking series, and glm series.
- Example endpoint registry:
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"
- Endpoint entries support optional
headers (or extra_headers) for custom HTTP headers sent with inference requests:
[[endpoint]]
endpoint_id = "my-proxy"
model = "gpt-4.1-mini"
url = "https://api.example/v1"
key = "OPENAI_API_KEY"
headers = { "X-Custom-Header" = "value" }
- Endpoint entries support
api_client_type when the provider is not OpenAI Chat Completions compatible. Use openai_responses for Responses-compatible endpoints and anthropic_messages for Anthropic Messages endpoints:
[[endpoint]]
endpoint_id = "gpt-responses"
model = "gpt-5.4-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
api_client_type = "openai_responses"
Publish Gate Before Large Runs
- After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
- Ask the user explicitly: should visibility be
PUBLIC or PRIVATE?
- Push with chosen visibility:
prime env push my-env --visibility PUBLIC
or
prime env push my-env --visibility PRIVATE
- For hosted environment workflows, prefer running large jobs against the Hub slug:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s
Prefer Config-Driven Evals Beyond Smoke Tests
- For anything beyond quick checks, nudge the user to create an eval TOML config.
- Use config files to run multiple evals in one command and keep runs reproducible:
prime eval run configs/eval/my-benchmark.toml
- Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.
- Use
name on individual [[eval]] entries when the same environment appears multiple times. id selects the environment to load; name labels the run in displays, summaries, metadata, and saved result paths.
Common Evaluation Patterns
- For single-environment v1 smoke runs, override typed taskset and harness config with dotted flags:
prime eval run my-env --taskset.difficulty hard --harness.max-turns 20
- For reproducible or multi-eval v1 config, put the same settings in TOML child sections:
[[eval]]
id = "my-env"
[eval.taskset]
difficulty = "hard"
[eval.harness]
max_turns = 20
- Override legacy/v0 constructor kwargs only when the environment still exposes them; for v1, use taskset/harness config instead:
prime eval run my-env -x '{"max_turns":20}'
- Bound per-rollout wall-clock time (use the dedicated
--timeout flag; wins over -x and TOML [eval.extra_env_kwargs]):
prime eval run my-env --timeout 600
- Save extra state columns:
prime eval run my-env -s -C "judge_response,parsed_answer"
- Resume interrupted runs:
prime eval run my-env -n 1000 -s --resume
- Save results to a custom output directory:
prime eval run my-env -s -o /path/to/output
- Run multi-environment TOML suites:
prime eval run configs/eval/my-benchmark.toml
- Run the same environment more than once with different args by giving each entry a
name:
[[eval]]
id = "reverse-text"
name = "reverse-text-short"
[eval.args]
max_length = 32
[[eval]]
id = "reverse-text"
name = "reverse-text-long"
[eval.args]
max_length = 256
- Put generation parameters in TOML sampling sections:
[sampling]
max_tokens = 1024
temperature = 0.7
reasoning_effort = "medium"
enable_thinking = true
[[eval]]
env_id = "my-env"
Use [eval.sampling] for per-eval overrides. [sampling] is shorthand for sampling_args; reasoning_effort and enable_thinking stay top-level and are mirrored into extra_body.chat_template_kwargs.
10. Pass extra HTTP headers via CLI (repeatable):
prime eval run my-env -m my-proxy --header "X-Custom-Header: value"
- Set headers in
[[eval]] TOML configs as a table or list (merge order: registry row < headers table < header list / --header):
[[eval]]
env_id = "my-env"
headers = { "X-Custom-Header" = "value" }
header = ["X-Another: val"]
- Run ablation sweeps using
[[ablation]] blocks in TOML configs:
[[ablation]]
env_id = "my-env"
[ablation.sweep]
temperature = [0.0, 0.5, 1.0]
[ablation.sweep.taskset]
difficulty = ["easy", "hard"]
This generates the cartesian product (6 configs in this example). Sweep v1 environment-owned settings under taskset or harness, not as root args. Use --abbreviated-summary (-A) for compact ablation results.
Inspect Saved Results
- Browse locally saved runs:
prime eval view
- Check
metadata.json for aggregate token usage and, when available, total-run cost.input_usd, cost.output_usd, and cost.total_usd.
- Inspect platform-visible runs when needed:
prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>
Metrics Interpretation
- Treat binary and continuous rewards differently.
- Use pass@k-style interpretation only when rewards are effectively binary.
- For continuous rewards, focus on distribution shifts and per-task means.
- Always inspect samples before concluding regressions.
Reliability Rules
- Keep environment/model/config fixed while comparing variants.
- Record exact command lines and key flags in the report.
- Call out missing credentials, endpoint mismatches, and dependency errors directly.
- Do not overinterpret tiny sample runs.
- Distinguish a completed rollout with poor reward from an environment/runtime failure.
- For timeout debugging, check the environment's own timeout behavior and the outer sandbox/eval timeout before changing reward logic.
- For repo example changes, use
tests/test_envs.py -k <env> when package installability is part of the risk, not just prime eval run from the current checkout.
Output Format
Return:
- Run configuration table.
- Aggregate metrics and key deltas.
- Sample-level failure themes.
- Clear recommendation: proceed, iterate environment, or retune model/sampling.