| name | optimize-with-environments |
| description | Optimize environment system prompts with GEPA through prime gepa run. Use when asked to improve prompt performance without gradient training, compare baseline versus optimized prompts, run GEPA from CLI or TOML configs, or interpret GEPA outputs before deployment. |
Optimize With Environments
Goal
Use GEPA to optimize system prompts in a controlled, reproducible loop.
Scope
Current GEPA path is for system prompt optimization. If user asks for unsupported optimization targets, stop and clarify before proceeding.
Endpoint And Model Selection Nudge
- Encourage users to define reusable aliases in
configs/endpoints.toml.
- Ask whether optimization should be validated on instruct or reasoning models.
- Instruct go-tos:
gpt-4.1 series, qwen3 instruct series.
- Reasoning go-tos:
gpt-5 series, qwen3 thinking series, glm series.
- For benchmark reporting, keep model family fixed between baseline and optimized comparisons unless the user requests a cross-family study.
- Endpoint entries support optional
headers (or extra_headers) for custom HTTP headers. GEPA inherits these from the registry for both the main model and the reflection model:
[[endpoint]]
endpoint_id = "my-proxy"
model = "gpt-4.1-mini"
url = "https://api.example/v1"
key = "OPENAI_API_KEY"
headers = { "X-Custom-Header" = "value" }
Core Workflow
- Verify baseline first with
prime eval run. Keep the default save behavior and do not add --skip-upload unless the user explicitly requests that deviation:
prime eval run my-env -m openai/gpt-4.1-mini -n 50 -r 3 -s
- For v1 Taskset + Harness environments, confirm prompt-like fields are exposed in the saved state or task info before GEPA reflection; BYO Harness implementations may render richer trajectories than classic
MultiTurnEnv examples.
- Run GEPA:
prime gepa run my-env -m openai/gpt-4.1-mini -M openai/gpt-4.1-mini -B 500 -n 100 -N 50
- Or run from config:
prime gepa run configs/gepa/qwen-3-5.toml
- Re-evaluate with optimized prompt and compare against baseline.
High-Value Settings
-B/--max-calls: total optimization budget.
-n/--num-train and -N/--num-val: train/validation split sizes.
--minibatch-size: reflection granularity.
--perfect-score: skip already-solved minibatches when max score is known.
--state-columns: include environment-specific context in reflection data.
Output Artifacts
Expect and inspect:
best_prompt.txt
pareto_frontier.jsonl
metadata.json
Quality Rules
- Do not optimize on top of broken reward logic.
- For weak deterministic checks, fix rubric quality before GEPA tuning.
- Keep model, sampling, and dataset conditions stable during baseline-vs-GEPA comparison.
- Report limitations directly when feature gaps block requested optimization.
Deliverable
Return:
- Baseline metrics.
- Optimized metrics.
- Prompt diff summary.
- Recommendation to adopt, iterate, or stop.