Run any Skill in Manus with one click

$pwd:

evaluate-environments

Name: Evaluate Environments
Author: PrimeIntellect-ai

// Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.

Run Skill in Manus

$ git log --oneline --stat

stars:4,072

forks:541

updated:May 6, 2026 at 08:58

SKILL.md

readonly

package.json

"author": "PrimeIntellect-ai"

"repository": "PrimeIntellect-ai/verifiers"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	evaluate-environments
description	Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.

Evaluate Environments

Goal

Run reliable environment evaluations and produce actionable summaries, not raw logs.

Canonical Eval Path

Use prime eval run as the default way to run evaluations.
Do not add --skip-upload or other opt-out flags unless the user explicitly requests that deviation.
Standard prime eval run runs save results automatically, keeping them available in the user's private Evaluations tab and locally in prime eval tui.

Core Loop

Run a smoke evaluation first (do not require pre-install):

prime eval run my-env -m openai/gpt-4.1-mini -n 5

Use owner/env slug directly when evaluating Hub environments:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 5

Scale only after smoke pass:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s

Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.

Endpoint Shortcuts And Model Family Choice

Encourage users to define endpoint aliases in configs/endpoints.toml so model, base URL, and key wiring stay reusable.
Use aliases via -m <endpoint_id> instead of repeating -b and -k.
Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
Instruct go-tos for quick behavior checks: gpt-4.1 series and qwen3 instruct series.
Reasoning go-tos for deeper test coverage: gpt-5 series, qwen3 thinking series, and glm series.
Example endpoint registry:

[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

Endpoint entries support optional headers (or extra_headers) for custom HTTP headers sent with inference requests:

[[endpoint]]
endpoint_id = "my-proxy"
model = "gpt-4.1-mini"
url = "https://api.example/v1"
key = "OPENAI_API_KEY"
headers = { "X-Custom-Header" = "value" }

Endpoint entries support api_client_type when the provider is not OpenAI Chat Completions compatible. Use openai_responses for Responses-compatible endpoints and anthropic_messages for Anthropic Messages endpoints:

[[endpoint]]
endpoint_id = "gpt-responses"
model = "gpt-5.4-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
api_client_type = "openai_responses"

Publish Gate Before Large Runs

After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
Ask the user explicitly: should visibility be PUBLIC or PRIVATE?
Push with chosen visibility:

prime env push my-env --visibility PUBLIC

prime env push my-env --visibility PRIVATE

For hosted environment workflows, prefer running large jobs against the Hub slug:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s

Prefer Config-Driven Evals Beyond Smoke Tests

For anything beyond quick checks, nudge the user to create an eval TOML config.
Use config files to run multiple evals in one command and keep runs reproducible:

prime eval run configs/eval/my-benchmark.toml

Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.

Common Evaluation Patterns

Pass args to load_environment():

prime eval run my-env -a '{"difficulty":"hard"}'

Override constructor kwargs:

prime eval run my-env -x '{"max_turns":20}'

Bound per-rollout wall-clock time (use the dedicated --timeout flag; wins over -x and TOML [eval.extra_env_kwargs]):

prime eval run my-env --timeout 600

Save extra state columns:

prime eval run my-env -s -C "judge_response,parsed_answer"

Resume interrupted runs:

prime eval run my-env -n 1000 -s --resume

Save results to a custom output directory:

prime eval run my-env -s -o /path/to/output

Run multi-environment TOML suites:

prime eval run configs/eval/my-benchmark.toml

Pass extra HTTP headers via CLI (repeatable):

prime eval run my-env -m my-proxy --header "X-Custom-Header: value"

Set headers in [[eval]] TOML configs as a table or list (merge order: registry row < headers table < header list / --header):

[[eval]]
env_id = "my-env"
headers = { "X-Custom-Header" = "value" }
header = ["X-Another: val"]

Run ablation sweeps using [[ablation]] blocks in TOML configs:

[[ablation]]
env_id = "my-env"

[ablation.sweep]
temperature = [0.0, 0.5, 1.0]

[ablation.sweep.args]
difficulty = ["easy", "hard"]

This generates the cartesian product (6 configs in this example). Use --abbreviated-summary (-A) for compact ablation results.

Inspect Saved Results

Browse locally saved runs:

prime eval tui

Inspect platform-visible runs when needed:

prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>

Metrics Interpretation

Treat binary and continuous rewards differently.
Use pass@k-style interpretation only when rewards are effectively binary.
For continuous rewards, focus on distribution shifts and per-task means.
Always inspect samples before concluding regressions.

Reliability Rules

Keep environment/model/config fixed while comparing variants.
Record exact command lines and key flags in the report.
Call out missing credentials, endpoint mismatches, and dependency errors directly.
Do not overinterpret tiny sample runs.

Output Format

Return:

Run configuration table.
Aggregate metrics and key deltas.
Sample-level failure themes.
Clear recommendation: proceed, iterate environment, or retune model/sampling.

name	evaluate-environments
description	Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.