一键在 Manus 中运行任何 Skill

$pwd:

evaluate-environments

Name: Evaluate Environments
Author: PrimeIntellect-ai

// Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.

在 Manus 中运行

$ git log --oneline --stat

stars:4,143

forks:553

updated:2026年5月29日 23:42

SKILL.md

readonly

related-skills.json

同仓库

create-environments.md

from "PrimeIntellect-ai/verifiers"

Create or migrate verifiers environments for the Prime Lab ecosystem. Use when asked to build a new environment from scratch, port an eval or benchmark from papers or other libraries, start from an environment on the Hub, or convert existing tasks into a package that exposes load_environment and installs cleanly with prime env install.

2026-05-304.1k

review-environments.md

from "PrimeIntellect-ai/verifiers"

Review verifiers environments for correctness, robustness, and ecosystem compatibility. Use when asked for environment code review, quality audit, migration validation, or release readiness checks for local environments or environments pulled from the Hub.

2026-05-304.1k

browse-environments.md

from "PrimeIntellect-ai/verifiers"

Discover and inspect verifiers environments through the Prime ecosystem. Use when asked to find environments on the Hub, compare options, inspect metadata, check action status, pull local copies for inspection, or choose environment starting points before evaluation, training, or migration work.

2026-05-304.1k

train-with-environments.md

from "PrimeIntellect-ai/verifiers"

Train models with verifiers environments using hosted RL or prime-rl. Use when asked to configure RL runs, tune key hyperparameters, diagnose instability, set up difficulty filtering, or create practical train and eval loops for new environments.

2026-05-304.1k

optimize-with-environments.md

from "PrimeIntellect-ai/verifiers"

Optimize environment system prompts with GEPA through prime gepa run. Use when asked to improve prompt performance without gradient training, compare baseline versus optimized prompts, run GEPA from CLI or TOML configs, or interpret GEPA outputs before deployment.

2026-05-144.1k

brainstorm.md

from "PrimeIntellect-ai/verifiers"

Run interactive brainstorming across verifiers environments, evaluations, GEPA, and RL training. Use when the user wants ideation, literature scanning, concept teaching, roadmap planning, or research program design grounded in local CLI sources, verifiers, and RL trainer code.

2026-05-064.1k

package.json

"author": "PrimeIntellect-ai"

"repository": "PrimeIntellect-ai/verifiers"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

数据科学家计算机与数学类职业15-2051L4

name	evaluate-environments
description	Run and analyze evaluations for verifiers environments using prime eval. Use when asked to smoke-test environments, run benchmark sweeps, resume interrupted evaluations, compare models, inspect sample-level outputs, or produce evaluation summaries suitable for deciding next steps.

Evaluate Environments

Goal

Run reliable environment evaluations and produce actionable summaries, not raw logs.

Canonical Eval Path

Use prime eval run as the default way to run evaluations.
Do not add --skip-upload or other opt-out flags unless the user explicitly requests that deviation.
Standard prime eval run runs save results automatically, keeping them available in the user's private Evaluations tab and locally in prime eval view.
For Prime Inference models with available pricing, eval output and saved metadata include estimated total-run USD cost automatically; no extra flags or API-key handling are needed.

Core Loop

Run a smoke evaluation first (do not require pre-install):

prime eval run my-env -m openai/gpt-4.1-mini -n 5

Use owner/env slug directly when evaluating Hub environments:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 5

Scale only after smoke pass:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s

Treat ownerless env ids as local-first. If not found locally, rely on Prime resolution for your remote env where applicable.
When the user asks for a "real" or "base" eval, do not substitute a tiny smoke run. Use the requested model/env and make the run size explicit before interpreting results.
If the user says the defaults are fine or asks for no flags, use the shortest canonical command and rely on global config:

prime eval run my-env
prime eval run my-env -m openai/gpt-4.1-mini

Endpoint Shortcuts And Model Family Choice

Encourage users to define endpoint aliases in configs/endpoints.toml so model, base URL, and key wiring stay reusable.
Use aliases via -m <endpoint_id> instead of repeating -b and -k.
Ask users explicitly whether they want an instruct or reasoning model before non-trivial evaluations.
Instruct go-tos for quick behavior checks: gpt-4.1 series and qwen3 instruct series.
Reasoning go-tos for deeper test coverage: gpt-5 series, qwen3 thinking series, and glm series.
Example endpoint registry:

[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-32b-i"
model = "qwen/qwen3-32b-instruct"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

Endpoint entries support optional headers (or extra_headers) for custom HTTP headers sent with inference requests:

[[endpoint]]
endpoint_id = "my-proxy"
model = "gpt-4.1-mini"
url = "https://api.example/v1"
key = "OPENAI_API_KEY"
headers = { "X-Custom-Header" = "value" }

Endpoint entries support api_client_type when the provider is not OpenAI Chat Completions compatible. Use openai_responses for Responses-compatible endpoints and anthropic_messages for Anthropic Messages endpoints:

[[endpoint]]
endpoint_id = "gpt-responses"
model = "gpt-5.4-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"
api_client_type = "openai_responses"

Publish Gate Before Large Runs

After smoke tests pass and results look stable, proactively suggest pushing the environment to Hub before large eval sweeps or RL work.
Ask the user explicitly: should visibility be PUBLIC or PRIVATE?
Push with chosen visibility:

prime env push my-env --visibility PUBLIC

prime env push my-env --visibility PRIVATE

For hosted environment workflows, prefer running large jobs against the Hub slug:

prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s

Prefer Config-Driven Evals Beyond Smoke Tests

For anything beyond quick checks, nudge the user to create an eval TOML config.
Use config files to run multiple evals in one command and keep runs reproducible:

prime eval run configs/eval/my-benchmark.toml

Make config files the default for benchmark sweeps, multi-model comparisons, and recurring reports.
Use name on individual [[eval]] entries when the same environment appears multiple times. id selects the environment to load; name labels the run in displays, summaries, metadata, and saved result paths.

Common Evaluation Patterns

For single-environment v1 smoke runs, override typed taskset and harness config with dotted flags:

prime eval run my-env --taskset.difficulty hard --harness.max-turns 20

For reproducible or multi-eval v1 config, put the same settings in TOML child sections:

[[eval]]
id = "my-env"

[eval.taskset]
difficulty = "hard"

[eval.harness]
max_turns = 20

Override legacy/v0 constructor kwargs only when the environment still exposes them; for v1, use taskset/harness config instead:

prime eval run my-env -x '{"max_turns":20}'

Bound per-rollout wall-clock time (use the dedicated --timeout flag; wins over -x and TOML [eval.extra_env_kwargs]):

prime eval run my-env --timeout 600

Save extra state columns:

prime eval run my-env -s -C "judge_response,parsed_answer"

Resume interrupted runs:

prime eval run my-env -n 1000 -s --resume

Save results to a custom output directory:

prime eval run my-env -s -o /path/to/output

Run multi-environment TOML suites:

prime eval run configs/eval/my-benchmark.toml

Run the same environment more than once with different args by giving each entry a name:

[[eval]]
id = "reverse-text"
name = "reverse-text-short"

[eval.args]
max_length = 32

[[eval]]
id = "reverse-text"
name = "reverse-text-long"

[eval.args]
max_length = 256

Put generation parameters in TOML sampling sections:

[sampling]
max_tokens = 1024
temperature = 0.7
reasoning_effort = "medium"
enable_thinking = true

[[eval]]
env_id = "my-env"

Use [eval.sampling] for per-eval overrides. [sampling] is shorthand for sampling_args; reasoning_effort and enable_thinking stay top-level and are mirrored into extra_body.chat_template_kwargs. 10. Pass extra HTTP headers via CLI (repeatable):

prime eval run my-env -m my-proxy --header "X-Custom-Header: value"

Set headers in [[eval]] TOML configs as a table or list (merge order: registry row < headers table < header list / --header):

[[eval]]
env_id = "my-env"
headers = { "X-Custom-Header" = "value" }
header = ["X-Another: val"]

Run ablation sweeps using [[ablation]] blocks in TOML configs:

[[ablation]]
env_id = "my-env"

[ablation.sweep]
temperature = [0.0, 0.5, 1.0]

[ablation.sweep.taskset]
difficulty = ["easy", "hard"]

This generates the cartesian product (6 configs in this example). Sweep v1 environment-owned settings under taskset or harness, not as root args. Use --abbreviated-summary (-A) for compact ablation results.

Inspect Saved Results

Browse locally saved runs:

prime eval view

Check metadata.json for aggregate token usage and, when available, total-run cost.input_usd, cost.output_usd, and cost.total_usd.
Inspect platform-visible runs when needed:

prime eval list
prime eval get <eval-id>
prime eval samples <eval-id>

Metrics Interpretation

Treat binary and continuous rewards differently.
Use pass@k-style interpretation only when rewards are effectively binary.
For continuous rewards, focus on distribution shifts and per-task means.
Always inspect samples before concluding regressions.

Reliability Rules

Keep environment/model/config fixed while comparing variants.
Record exact command lines and key flags in the report.
Call out missing credentials, endpoint mismatches, and dependency errors directly.
Do not overinterpret tiny sample runs.
Distinguish a completed rollout with poor reward from an environment/runtime failure.
For timeout debugging, check the environment's own timeout behavior and the outer sandbox/eval timeout before changing reward logic.
For repo example changes, use tests/test_envs.py -k <env> when package installability is part of the risk, not just prime eval run from the current checkout.

Output Format

Return:

Run configuration table.
Aggregate metrics and key deltas.
Sample-level failure themes.
Clear recommendation: proceed, iterate environment, or retune model/sampling.

evaluate-environments

同仓库更多 Skills

Evaluate Environments

Goal

Canonical Eval Path

Core Loop

Endpoint Shortcuts And Model Family Choice

Publish Gate Before Large Runs

Prefer Config-Driven Evals Beyond Smoke Tests

Common Evaluation Patterns

Inspect Saved Results

Metrics Interpretation

Reliability Rules

Output Format

Evaluate Environments

Goal

Canonical Eval Path

Core Loop

Endpoint Shortcuts And Model Family Choice

Publish Gate Before Large Runs

Prefer Config-Driven Evals Beyond Smoke Tests

Common Evaluation Patterns

Inspect Saved Results

Metrics Interpretation

Reliability Rules

Output Format

同仓库更多 Skills