com um clique
synthesize-task
// Synthesize a new general-agent task family from a seed task, evolving through difficulty tiers with empirical pass-rate gating. Use when asked to create new tasks or grow the task set.
// Synthesize a new general-agent task family from a seed task, evolving through difficulty tiers with empirical pass-rate gating. Use when asked to create new tasks or grow the task set.
Push all local environments to the Prime Intellect Environments Hub. Use when environments are out of sync and need to be published.
Search the web via the Exa API. Accepts up to 10 queries in parallel. Returns titles, URLs, and highlighted snippets from each result.
Search Google via the Serper API. Accepts up to 10 queries in parallel. Returns titles, URLs, snippets, and knowledge-graph data.
Replace a unique string in a file. old_str must appear exactly once in the file.
Diagnose and repair-or-remove broken tasks under environments/general_agent/tasks. Use after `general-agent validate` reports [FAIL] entries (post-synthesis, after a refactor, or on a periodic sweep).
Fetch a webpage via Exa and return an LLM-generated summary. Accepts an optional query that steers what the summary focuses on.
| name | synthesize-task |
| description | Synthesize a new general-agent task family from a seed task, evolving through difficulty tiers with empirical pass-rate gating. Use when asked to create new tasks or grow the task set. |
Create a new task family for the general-agent environment. A task family is a sequence of task instances (tier 0 → tier N) in the same task family, where each tier is a superset of the previous: higher tiers may extend the DB schema with new entities and relationships, add new tools, and introduce harder constraints. Tier i+1 should be strictly more complex than tier i.
The overarching goal is to evolve a trivial task into an ultra-challenging one. Tier 0 should be dead simple (1-3 tool calls, small DB). By tier 4, the task should push even the strongest models to near-zero pass rates through a combination of large-scale data, layered constraints, and multi-step reasoning.
Each task instance is a directory under tasks/. Naming convention: <task>_t<tier> (e.g. hotel_booking_t0, hotel_booking_t1). The tier 0 seed is always <task>_t0.
<task>_t0/
├── task.toml # [metadata] name, description, tier, parent, difficulty_methods, [[pass_rates]]
├── instruction.md # agent prompt (what to do)
├── db.json # initial database state (Pydantic-serialized)
├── tools.py # TaskDB(DB), TaskTools(Tools), verify(db) — schema + tools + verifier
├── gold.json # gold tool-call chain: [["tool_name", {kwargs}], ...]
└── gen_db.py # (tier 2+ only) script that generates db.json — kept for reproducibility
The key abstractions (defined in general_agent/tools.py):
db.json. The agent reads/writes this through tools.@tool-decorated methods that read/write self.db.tools.py that checks whether the task goal is satisfied. Returns 1.0 on success, 0.0 on failure. This is REQUIRED — it allows scoring alternative valid solutions beyond the exact gold path.self (with self.db) pre-bound.Example task tools.py:
from general_agent.tools import DB, Tools, tool
from pydantic import BaseModel
class Order(BaseModel):
id: str
customer: str
total: float
status: str = "pending"
class TaskDB(DB):
orders: list[Order] = []
class TaskTools(Tools):
db: TaskDB
@tool
def get_order(self, order_id: str) -> dict:
"""Look up an order by ID.
Args:
order_id: The order ID.
"""
for o in self.db.orders:
if o.id == order_id:
return o.model_dump()
raise ValueError(f"Order {order_id} not found")
@tool
def cancel_order(self, order_id: str) -> str:
"""Cancel an order.
Args:
order_id: The order ID to cancel.
"""
for o in self.db.orders:
if o.id == order_id:
o.status = "cancelled"
return f"Order {order_id} cancelled"
raise ValueError(f"Order {order_id} not found")
def verify(db: TaskDB) -> float:
"""Check whether the task goal is satisfied.
REQUIRED for every task. Must return 1.0 on success, 0.0 on failure.
Should check the goal semantically, not just match the gold solution exactly.
For example: "order ORD-001 is cancelled" rather than "DB matches gold hash".
"""
order = next((o for o in db.orders if o.id == "ORD-001"), None)
if order is None:
return 0.0
return 1.0 if order.status == "cancelled" else 0.0
IMPORTANT: Every tools.py MUST define a verify(db: TaskDB) -> float function. This function checks the task goal semantically — it should accept ANY valid solution, not just the exact gold path. The rubric uses verify(db) as a fallback when the agent's solution differs from the gold but is still correct.
First, list every existing family so you don't collide with one:
general-agent list
Pick a family root name (the prefix before _t<N>, e.g. food_truck, hotel_booking). The following are FORBIDDEN — every root that already appears in general-agent list is taken, and the rubric rejects any rollout whose manifest collides with an existing family. Your root name MUST differ from every existing one — not just a suffix variant (food_truck_2, food_truck_v2), a different root word entirely. If unsure, pick a more specific niche (crepe_stand instead of food_truck).
Good tasks have:
Commit to the name before doing any other work. Write the chosen root name to a file so the harness can audit it:
mkdir -p /workspace/.synthesizer
echo "<root>" > /workspace/.synthesizer/family_name.txt
GATE 0a: the chosen root does not collide with any existing family. This must pass before you write any code:
root=$(cat /workspace/.synthesizer/family_name.txt)
if general-agent list | awk '{print $1}' | grep -E "^${root}(_t[0-9]+)?$" >/dev/null; then
echo "COLLISION: '${root}' is already a family — pick a different root" >&2
exit 1
fi
echo "OK: ${root} is available"
If this fails, pick a new root, overwrite family_name.txt, and re-run the check. Do NOT proceed to writing tools.py until it passes.
Now design the Pydantic DB schema and write tools.py with TaskDB(DB) and TaskTools(Tools).
GATE 0b: tools.py loads cleanly:
python -c "
from general_agent.utils import load_attr
from pathlib import Path
p = Path('tasks/<name>/tools.py')
assert load_attr(p, 'TaskDB') is not None, 'TaskDB not found'
assert load_attr(p, 'TaskTools') is not None, 'TaskTools not found'
assert load_attr(p, 'verify') is not None, 'verify function not found'
print('OK')
"
Create the simplest useful tier for this task:
db.json with realistic seed data (5-20 entities). At tier 0, hand-writing the JSON is fine.instruction.md — the task prompt, written as a natural user request (conversational tone, one or two paragraphs, no bullet points or numbered steps). The agent should figure out the steps from the description, not be told them explicitly.gold.json — the gold tool-call chaintask.toml with tier = 0The seed should require 1-3 tool calls and be easily solvable by any capable model.
IMPORTANT: Instructions must read like something a real user would type — not structured task specs. Bad: "Step 1: call get_recipe. Step 2: call check_pantry." Good: "I want to cook pasta tonight, can you check what ingredients I'm missing and add them to my shopping list?"
Dates must be unambiguous. If the gold solution passes a date string to a tool (e.g. date="2026-05-15"), the instruction must include the same explicit year (e.g. "for May 15, 2026" or "on 2026-05-15"). Never write dates as "May 15" / "May 15th" / "next Tuesday" and expect the agent to infer the year — solver models with pre-2026 pretraining cutoffs will default to 2025 (or whatever their cutoff implies) and the rollout will fail verify(db) despite otherwise solving the task. This was a corpus-wide failure mode in the GLM-5.1-FP8 RLM eval; ~30% of single-tier zero-pass-rate tasks were caused by it. Same rule applies to anything else that depends on "today's date" — make it explicit in the instruction.
GATE 1: Gold solution validates:
general-agent validate <name>
Run the task against the solver model with exactly 20 rollouts (-r 20) to get a reliable difficulty estimate:
vf-eval general-agent-solver-local --disable-env-server -c -1 -b <solver_base_url> -k <solver_api_key_var> -m <solver_model> -n 1 -r 20 -d -a '{"task":"<name>"}'
You MUST use -r 20. Fewer rollouts give unreliable estimates — do not reduce this number.
Note: This command can take a few minutes. Set the command timeout high enough to avoid premature termination.
GATE 2: avg reward must be ≥ 0.80. If not, the seed is too hard — simplify the instruction or DB, then re-run. Do NOT proceed to tier 1 until this gate passes.
This gate is critical. If the solver command fails (e.g. API errors), fix the issue and retry. Do NOT skip pass-rate gating — tasks without empirical difficulty validation are worthless.
Record the pass rate (the reward: avg line from the output) in task.toml
as an entry in the [[metadata.pass_rates]] array of tables:
[metadata]
name = "<task>_t0"
description = "..."
tier = 0
parent = ""
difficulty_methods = []
[[metadata.pass_rates]]
solver = "local" # solver type: local, opencode, rlm
model = "openai/gpt-5-mini"
k = 20 # number of rollouts (must be 20)
value = 0.95 # mean reward from the empirical pass-rate check
Each measurement is keyed on (solver, model, k); multiple entries are
allowed if the task has been measured under more than one solver/model.
For each tier k = 1, 2, 3, 4:
Copy the full previous tier directory to tasks/<task>_t<k>/ as a starting point. Use cp -rT tasks/<task>_t<k-1> tasks/<task>_t<k> — the -T flag treats the destination as a literal name and prevents the footgun where cp -r src dst/ nests src inside an already-existing dst/ (producing e.g. tasks/<task>_t4/<task>_t3/…). After copying, verify with ls tasks/<task>_t<k>/ — expect exactly db.json, gold.json, instruction.md, task.toml, tools.py and no nested tier directory.
Extend the task — each tier should be a superset of the previous. You are encouraged to:
Add at least one constraint from this menu (pick the most natural for the task):
instruction.md — the kind a real user would make when typing quickly (e.g. "restraunt" for "restaurant", "teh" for "the", swapped letters, missing words). The agent must parse intent despite surface noise. Don't overdo it — 2-5 typos per instruction is enough to add friction without making it unreadable.Update instruction.md — weave the new constraint into the natural language request. Do NOT list steps or mention tool names. The instruction should read like a user message, not a spec.
Update db.json if needed (more entities, edge cases).
db.json programmatically using a Python script (gen_db.py) placed in the task directory. The DB should contain hundreds or thousands of entities to force the agent to search, filter, and reason over large datasets. Use random.seed(42) for reproducibility. The script writes db.json to the same directory. Keep gen_db.py in the task directory — it is not part of the task runtime, but it must be committed for reproducibility so anyone can re-generate the DB. Example: a tier-3 hotel booking task might generate 500 hotels across 50 cities, 2000 room listings, and 300 guest profiles, so the agent must filter and cross-reference rather than eyeball the data.Write gold.json — the gold solution MUST require strictly more tool calls than tier k-1
Write task.toml, recording which difficulty method(s) you used and the empirical pass rate. The difficulty_methods and [[metadata.pass_rates]] entries are required metadata for every tier (including tier 0). They document how each tier was made harder and how hard it actually is empirically.
[metadata]
name = "<task>_t<k>"
description = "..."
tier = <k>
parent = "<task>_t<k-1>" # or "<task>_t0" for k=1
difficulty_methods = ["cross_entity_coupling", "conditional_rules"] # mechanisms used to increase difficulty vs parent
[[metadata.pass_rates]]
solver = "local"
model = "openai/gpt-5-mini"
k = 20
value = 0.55 # avg@20 from empirical pass-rate check (GATE 3b)
Diversity requirement: across the entire task family (tiers 0-4), you must use at least 5 unique difficulty methods. Don't reuse the same method in consecutive tiers. The full set is: multi_step_reasoning, conditional_rules, cross_entity_coupling, stricter_thresholds, larger_db, ambiguity_resolution, tool_proliferation, schema_extension, noisy_instructions.
GATE 3a: Gold solution validates:
general-agent validate <task>_t<k>
GATE 3b: Empirical pass-rate is in the target band (exactly 50 rollouts):
vf-eval general-agent-solver-local --disable-env-server -c -1 -b <solver_base_url> -k <solver_api_key_var> -m <solver_model> -n 1 -r 20 -d -a '{"task":"<task>_t<k>"}'
Target bands per tier:
If pass rate is above the band: constraint too weak — tighten it or add another. If pass rate is below the band: constraint too hard — relax it or simplify the DB.
Do NOT proceed to tier k+1 until tier k's gate passes. Do NOT skip this check — if the solver command fails, fix and retry. A task family where the solver gets 100% on all tiers is useless for training.
Lint and type-check all created task files:
ruff check tasks/<task>_t*/tools.py
ruff format --check tasks/<task>_t*/tools.py
Fix any issues before proceeding.
Run all tiers for the task you just created:
general-agent validate <task>
vf-eval general-agent-solver-local --disable-env-server -c -1 -b <solver_base_url> -k <solver_api_key_var> -m <solver_model> -n -1 -r 1 -d -v -a '{"task":"<task>"}'
Verify:
This is a high-tier trip-planning task with cross-entity coupling, conditional budget rules, and multi-day constraints:
I'm planning a three-day trip starting from Hangzhou, and I need help creating an itinerary from October 1st to October 3rd, 2025. A few important requirements: I don't want to repeat any cities, hotels, attractions, or restaurants during the entire trip. Also, please make sure that every hotel, restaurant, and attraction you recommend is actually located in the city where I'll be staying that day. One more thing about the second day - I'm trying to be smart about my budget. If I end up booking a luxury hotel that costs 800 CNY or more per night, then I need to be more careful with other expenses: my total spending on both restaurants (lunch and dinner) should stay under 350 CNY, both restaurants should be rated at least 4.0 stars, and the afternoon attraction ticket needs to be less than 120 CNY. If the hotel on day 2 is in the mid-to-high range (500-800 CNY), then I have a bit more flexibility - I just need to make sure at least one of my restaurant choices is rated 4.0 or higher, and the attraction ticket should be below 180 CNY. For more affordable hotels (200-500 CNY range), I only need to ensure that at least one restaurant has a rating of 3.2 or above. Can you help me put together this itinerary?
Notice how this single prompt layers multiple difficulty levers:
A tier-0 seed for this task would be: "Book a hotel in Hangzhou for October 1st." Each subsequent tier adds one more constraint.