| name | evaluate-environments |
| description | Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps. |
Evaluate Environments
Goal
Run reliable environment evaluations and produce actionable summaries, not raw logs.
Canonical Eval Path
- Use
prime eval run as the default way to run evaluations.
- Do not add
--skip-upload or other opt-out flags unless the user explicitly requests that deviation.
- Standard
prime eval run runs save results automatically, keeping them available in the user's private Evaluations tab and locally in prime eval tui.
Core Loop
- Run a smoke evaluation first (do not require pre-install):
prime eval run my-env -m openai/gpt-4.1-mini -n 5
- Use owner/env slug directly when evaluating Hub environments:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 5
- Scale only after smoke pass:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s
- Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.
Endpoint Shortcuts And Model Family Choice
- Encourage users to define endpoint aliases in
configs/endpoints.toml so model, base URL, and key wiring stay reusable.
- Use aliases via
-m <endpoint_id> instead of repeating -b and -k.
- Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
- Instruct go-tos for quick behavior checks:
gpt-4.1 series and qwen3 instruct series.
- Reasoning go-tos for deeper test coverage:
gpt-5 series, qwen3 thinking series, and glm series.
- Example endpoint registry:
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"
- Endpoint entries support optional
headers (or extra_headers) for custom HTTP headers sent with inference requests:
[[endpoint]]
endpoint_id = "my-proxy"
model = "gpt-4.1-mini"
url = "https://api.example/v1"
key = "OPENAI_API_KEY"
headers = { "X-Custom-Header" = "value" }
- Endpoint entries support
api_client_type when the provider is not OpenAI Chat Completions compatible. Use openai_responses for Responses-compatible endpoints and anthropic_messages for Anthropic Messages endpoints:
[[endpoint]]
endpoint_id = "gpt-responses"
model = "gpt-5.4-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
api_client_type = "openai_responses"
Publish Gate Before Large Runs
- After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
- Ask the user explicitly: should visibility be
PUBLIC or PRIVATE?
- Push with chosen visibility:
prime env push my-env --visibility PUBLIC
or
prime env push my-env --visibility PRIVATE
- For hosted environment workflows, prefer running large jobs against the Hub slug:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s
Prefer Config-Driven Evals Beyond Smoke Tests
- For anything beyond quick checks, nudge the user to create an eval TOML config.
- Use config files to run multiple evals in one command and keep runs reproducible:
prime eval run configs/eval/my-benchmark.toml
- Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.
Common Evaluation Patterns
- Pass args to
load_environment():
prime eval run my-env -a '{"difficulty":"hard"}'
- Override constructor kwargs:
prime eval run my-env -x '{"max_turns":20}'
- Bound per-rollout wall-clock time (use the dedicated
--timeout flag; wins over -x and TOML [eval.extra_env_kwargs]):
prime eval run my-env --timeout 600
- Save extra state columns:
prime eval run my-env -s -C "judge_response,parsed_answer"
- Resume interrupted runs:
prime eval run my-env -n 1000 -s --resume
- Save results to a custom output directory:
prime eval run my-env -s -o /path/to/output
- Run multi-environment TOML suites:
prime eval run configs/eval/my-benchmark.toml
- Pass extra HTTP headers via CLI (repeatable):
prime eval run my-env -m my-proxy --header "X-Custom-Header: value"
- Set headers in
[[eval]] TOML configs as a table or list (merge order: registry row < headers table < header list / --header):
[[eval]]
env_id = "my-env"
headers = { "X-Custom-Header" = "value" }
header = ["X-Another: val"]
- Run ablation sweeps using
[[ablation]] blocks in TOML configs:
[[ablation]]
env_id = "my-env"
[ablation.sweep]
temperature = [0.0, 0.5, 1.0]
[ablation.sweep.args]
difficulty = ["easy", "hard"]
This generates the cartesian product (6 configs in this example). Use --abbreviated-summary (-A) for compact ablation results.
Inspect Saved Results
- Browse locally saved runs:
prime eval tui
- Inspect platform-visible runs when needed:
prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>
Metrics Interpretation
- Treat binary and continuous rewards differently.
- Use pass@k-style interpretation only when rewards are effectively binary.
- For continuous rewards, focus on distribution shifts and per-task means.
- Always inspect samples before concluding regressions.
Reliability Rules
- Keep environment/model/config fixed while comparing variants.
- Record exact command lines and key flags in the report.
- Call out missing credentials, endpoint mismatches, and dependency errors directly.
- Do not overinterpret tiny sample runs.
Output Format
Return:
- Run configuration table.
- Aggregate metrics and key deltas.
- Sample-level failure themes.
- Clear recommendation: proceed, iterate environment, or retune model/sampling.