| name | create-environments |
| description | Create or migrate verifiers environments for the Prime Lab ecosystem. Use when asked to build a new environment from scratch, port an eval or benchmark from papers or other libraries, start from an environment on the Hub, or convert existing tasks into a package that exposes load_environment and installs cleanly with prime env install. |
Create Environments
Goal
Build production-quality verifiers environments that work immediately in the Prime ecosystem: install, load, evaluate, and train without hidden setup.
Start With Ecosystem Paths
- Prefer ecosystem-native setup before custom scaffolding.
- Use this default loop:
prime env init my-env --v1
prime env install my-env
prime eval run my-env -m openai/gpt-4.1-mini -n 5
Use prime env init my-env --v1 --with-harness when the environment owns an
explicit reusable harness.
3. Treat prime eval run as the canonical eval path. It saves results automatically, so do not add --skip-upload unless the user explicitly requests that deviation.
4. Prefer an existing environment as a starting point when possible:
prime env list --search "keyword"
prime env info owner/name
prime env install owner/name
- For repository examples, use repo install when available:
prime env install math-python --from-repo
- Encourage users to keep endpoint aliases in
configs/endpoints.toml so smoke tests can switch models quickly.
- Ask users whether they want instruct or reasoning models for validation.
- Instruct-first smoke choices:
gpt-4.1 series, qwen3 instruct series.
- Reasoning validation choices:
gpt-5 series, qwen3 thinking series, glm series.
Build Modes
1. Build From Scratch
- Define task contract first: prompt shape, allowed tools, stop conditions, rubric outputs, metrics.
- Select the smallest correct base class:
SingleTurnEnv for one-response tasks.
MultiTurnEnv for custom interaction loops.
ToolEnv or MCPEnv for stateless tools.
StatefulToolEnv for per-rollout resources.
CliAgentEnv for running agent binaries in sandboxes with API interception. Override get_sandbox_resources(state) for per-instance resources, build_env_vars(state) for custom env vars.
- V1
vf.Env with explicit vf.Taskset/vf.Harness objects for the current taskset/harness environment pattern that separates the task collection from the rollout runner. Use this for new taskset/harness work that needs config-driven metrics, rewards, toolsets, user functions, endpoint interception, or sandboxed Python/command programs. Framework programs should build clients from state.get_endpoint_config(api="chat").
- For v1, start from the generated template. Edit
TasksetConfig for task settings, Taskset.load_tasks() for task records, Taskset.load_toolsets() for task-owned tools, User subclasses for user behavior, and @vf.* methods for lifecycle, metrics, rewards, and advantages. Add a harness class only for reusable execution behavior.
- Keep
load_environment(config: vf.EnvConfig) as the canonical Taskset/Harness shim:
def load_environment(config: vf.EnvConfig) -> vf.Env:
"""Loader pattern for all Taskset/Harness environments."""
return vf.Env(
taskset=vf.load_taskset(config=config.taskset),
harness=vf.load_harness(config=config.harness),
)
- For v0 environments, keep the existing
vf.Environment patterns and preserve v0 compatibility.
- Add
pyproject.toml defaults in [tool.verifiers.eval] only when stable.
V1 Authoring Rules
- Keep v1 environment entrypoints tiny:
import verifiers as vf, define TasksetConfig / optional HarnessConfig subclasses for user-facing knobs, define Taskset / optional Harness classes, then expose typed child loaders and the canonical load_environment(config: vf.EnvConfig) shim that delegates through vf.load_taskset and vf.load_harness.
- Keep shared dependencies behind the taskset or harness that owns them. Use bindings as the canonical injection path; prefer serializable loader paths for bound objects in config, and use no-arg loader callables only for Python-only construction. Do not pass already-instantiated resource objects through environment loaders. Do not introduce v1 Parser/Rubric wrappers; parsing is ordinary Python.
- Use
vf.get_messages(state.get("completion") or [], role="assistant") when reading state completions. The helper returns typed message objects and should not receive None.
- Use
program.channels for v1 program protocol/channel selection. Do not use stale program.tools terminology.
- Use generated child loaders as typed component entrypoints. Add implementation behavior to the taskset or harness class through config fields,
load_* methods, User subclasses, Toolset, and @vf.* lifecycle methods.
- Put settings as leaf fields on the taskset or harness config that owns them.
V1 Taskset/Harness Shape
- Put task data, task-owned tools, user behavior, metrics, rewards, and task-specific configuration on the
Taskset.
- Use the base
vf.Harness unless the harness owns a reusable execution adapter such as a CLI, framework program, sandboxed program, or nested harness flow.
- Avoid one-off harness classes whose only purpose is to hold task behavior. That behavior belongs behind the taskset.
- Keep small example environments direct. Do not add private helper layers, duplicate loader paths, or optional knobs unless they clarify a real reusable boundary.
- Use the current config shape consistently:
[[eval]]
env_id = "owner/my-env"
[eval.taskset]
num_examples = 100
[eval.harness]
max_turns = 8
For package-only composition, omit env_id and select loader packages through
child config ids:
[[eval]]
[eval.taskset]
id = "tasksets.harbor"
tasks_dir = "tasks"
[eval.harness]
id = "harnesses.opencode"
max_turns = 8
- In code, use the current class-based config shape:
import verifiers as vf
class MyTasksetConfig(vf.TasksetConfig):
system_prompt: vf.SystemPrompt = "Answer exactly."
class MyTaskset(vf.Taskset[MyTasksetConfig]):
def load_tasks(self, split: vf.TaskSplit = "train") -> vf.Tasks:
"""Return serializable task records as a list, generator, or Dataset."""
if split == "eval":
return []
return [
{
"prompt": [{"role": "user", "content": "Reverse abc."}],
"answer": "cba",
"max_turns": 1,
}
]
@vf.reward(weight=1.0)
async def correct_answer(self, task: vf.Task, state: vf.State) -> float:
messages = vf.get_messages(state.get("completion") or [], role="assistant")
if not messages:
return 0.0
response = str(messages[-1].content or "").strip()
return float(response == task["answer"])
def load_taskset(config: MyTasksetConfig) -> MyTaskset:
return MyTaskset(config=config)
def load_environment(config: vf.EnvConfig) -> vf.Env:
"""Loader pattern for all Taskset/Harness environments."""
return vf.Env(
taskset=vf.load_taskset(config=config.taskset),
harness=vf.load_harness(config=config.harness),
)
- Use
prime env init my-env --v1 as the reference shape when an implementation starts to drift.
2. Port From Another Library, Project, or Paper
- Create a strict source-to-target mapping before coding:
- dataset rows and splits
- prompt rendering and role ordering
- tool I/O schema and stop logic
- scoring math and aggregation
- pass/fail thresholds and special cases
- Preserve one-to-one logical equivalence for what the model sees and what gets scored.
- Never invent unresolved formatting decisions. Ask the user to decide explicitly.
- Benchmark runtime and remove avoidable bottlenecks before handoff.
3. Start From Hub Environment
- Install or pull the closest baseline:
prime env install owner/name
prime env pull owner/name -t ./tmp-env
- Keep proven interfaces stable unless a migration is deliberate and explicit.
- Re-run smoke evals after each major change.
Non-Negotiable Quality Rules
- Use deterministic, well-defined reward checks or LLM judges.
- Avoid best-effort deterministic heuristics such as keyword style checks except as an explicit last resort with user sign-off.
- Make environments self-contained after install. Do not require users to run background servers before
load_environment().
- Manage external resources inside the environment lifecycle.
- Validate required secrets in
load_environment() via vf.ensure_keys(...).
- Surface feature limits directly. Do not ship hacky workarounds without explicit user approval.
Verification Gate
Run these before claiming completion:
prime env install my-env
prime eval run my-env -m openai/gpt-4.1-mini -n 5
prime eval run my-env -m openai/gpt-4.1-mini -n 50 -r 1 -s
If multi-turn or tool-heavy, also run with higher rollouts:
prime eval run my-env -m openai/gpt-4.1-mini -n 30 -r 3 -s
For repo example environments, also use the package-install path when packaging or dependencies changed:
uv run pytest tests/test_envs.py -k my_env -vv
Publish Gate Before Large Evals Or Training
- After smoke tests pass and behavior is stable, recommend pushing to Hub before large evals or RL training.
- Ask the user explicitly whether visibility should be
PUBLIC or PRIVATE.
- Use:
prime env push my-env --visibility PUBLIC
or
prime env push my-env --visibility PRIVATE
- For hosted or large-scale workflows, prefer running with the Hub slug after push:
prime eval run owner/my-env -m openai/gpt-4.1-mini -n 200 -r 3 -s
Synthetic Data
- Ask users for preferences on which LLMs to use for synthetic data generation and curation before implementation.
- Prefer generating synthetic data from raw source documents whenever possible instead of relying only on hand-authored prompts.
- Use LLM orchestration (planner/generator/validator loops) to improve sample quality and diversity.
- Use back-translation: start from complete materials and decompose them into incomplete tasks, criteria, or partial artifacts that the model must reconstruct.
- Use fan-out subtopic sampling from LLMs to expand coverage and avoid overfitting to a narrow slice of the domain.
Deliverable Format
Report:
- Environment ID and path.
- Exact install and eval commands used.
- Port-equivalence notes if migrated.
- Any unresolved user decisions that block strict fidelity.