| name | train-with-environments |
| description | Train models with verifiers environments using hosted RL or prime-rl. Use when asked to configure RL runs, tune key hyperparameters, diagnose instability, set up difficulty filtering and oversampling, or create practical train and eval loops for new environments. |
Train With Environments
Goal
Run stable RL training loops with environment-aware hyperparameter choices and clear diagnostics.
Preferred Training Paths
- By default, assume users intend to use Hosted Training unless they explicitly ask for self-managed training.
- Hosted Training service path from lab setup:
prime lab setup
- Self-managed
prime-rl workflow:
prime lab setup --prime-rl
uv run prime-rl configs/prime-rl/wiki-search.toml
- Treat
prime-rl as a power-user path and assume users are comfortable working with GPU infrastructure and troubleshooting.
- Runtime expectation:
- Hosted Training is intended to be launched from a CPU machine.
- Local
prime-rl training requires local GPU access.
Endpoint Shortcuts And Model Family Choice
- Encourage users to maintain endpoint aliases in
configs/endpoints.toml for eval and train loops.
- Ask whether they want instruct or reasoning models for pre-training validation.
- Instruct go-tos for behavior checks:
gpt-4.1 series, qwen3 instruct series.
- Reasoning go-tos for harder reasoning-heavy probes:
gpt-5 series, qwen3 thinking series, glm series.
First-Run Protocol
- Validate environment behavior before training with the canonical eval path. Keep the default save behavior and do not add
--skip-upload unless the user explicitly requests that deviation:
prime env install my-env
prime eval run my-env -m openai/gpt-4.1-mini -n 20 -r 3 -s
- For v1 Taskset + Harness environments, verify the package still exposes
load_environment(...) -> vf.Environment; trainers interact with the same environment boundary even when the implementation is BYO Harness internally.
- Confirm reward diversity exists at baseline.
- Start with conservative run length and inspect samples early.
Publish Gate Before RL
- Before long training runs, proactively recommend pushing the environment to Hub once smoke evals are stable.
- Ask the user explicitly whether visibility should be
PUBLIC or PRIVATE.
- Push with chosen visibility:
prime env push my-env --visibility PUBLIC
or
prime env push my-env --visibility PRIVATE
- For hosted RL and shared workflows, prefer Hub IDs after push (for example
owner/my-env in config [[env]].id).
RL TOML Environment Sections
- Use the same environment config shape for Hosted Training and
prime-rl.
- Put normal
load_environment(...) named args in [env.args].
- Put v1 taskset config in
[env.taskset] and v1 harness config in [env.harness].
- Keep model, endpoint, sampling, rollout count, and trainer controls outside the environment sections unless configuring a nested or auxiliary harness model.
[[env]]
id = "owner/my-env"
[env.args]
split = "train"
[env.taskset]
num_examples = 1000
[env.harness]
max_turns = 8
Hyperparameter Rules Of Thumb
- Use
rollouts_per_example and batch_size together.
- Treat
batch_size as total rollout samples per step, not number of groups.
- Keep
batch_size divisible by rollouts_per_example.
- Quick tests or simpler environments:
rollouts_per_example = 8
batch_size = 128 (or lower)
- More complex or longer-horizon environments:
rollouts_per_example = 16
batch_size = 512 (common strong starting point)
- Increase gradually from stable settings instead of jumping directly to aggressive configs.
Difficulty Filtering And Oversampling
- For mostly binary rewards, enable difficulty filtering and consider oversampling:
buffer.online_difficulty_filtering = true
oversampling_factor > 1 (for example 2.0)
- For continuous rewards, usually avoid binary-style filtering assumptions and keep filtering conservative or off until validated.
- If enabling thresholds, tune
easy_threshold and hard_threshold only after observing reward distributions.
Stability Constraints From Prime-RL
- Ensure
max_concurrent >= rollouts_per_example * workers_per_env.
- Keep async level explicit (
max_async_level) and monitor off-policy drift.
- For OOM risk, reduce rollout pressure and sequence lengths before widening training scope.
Failure Diagnosis
- Flat reward near zero:
- Task too hard, rubric mismatch, or prompt/tool contract mismatch.
- Unstable reward swings:
- Lower learning rate, increase rollout group size, reduce async aggressiveness.
- Slow learning despite stability:
- Revisit task difficulty and reward shaping before increasing risk knobs.
Non-Negotiable Environment Quality During Training
- Use deterministic robust checks or LLM judges for rewards.
- Reject best-effort keyword heuristics unless explicitly approved as last resort.
- Keep environments self-contained after install; no user-managed background services.
- Surface feature limitations directly instead of proposing hidden workarounds.
Deliverable
Return:
- Config deltas applied.
- Why each delta was chosen.
- Observed metrics and failure signatures.
- Next tuning step with stop conditions.