com um clique
cleanup-tasks
// Diagnose and repair-or-remove broken tasks under environments/general_agent/tasks. Use after `general-agent validate` reports [FAIL] entries (post-synthesis, after a refactor, or on a periodic sweep).
// Diagnose and repair-or-remove broken tasks under environments/general_agent/tasks. Use after `general-agent validate` reports [FAIL] entries (post-synthesis, after a refactor, or on a periodic sweep).
Push all local environments to the Prime Intellect Environments Hub. Use when environments are out of sync and need to be published.
Search the web via the Exa API. Accepts up to 10 queries in parallel. Returns titles, URLs, and highlighted snippets from each result.
Search Google via the Serper API. Accepts up to 10 queries in parallel. Returns titles, URLs, snippets, and knowledge-graph data.
Replace a unique string in a file. old_str must appear exactly once in the file.
Synthesize a new general-agent task family from a seed task, evolving through difficulty tiers with empirical pass-rate gating. Use when asked to create new tasks or grow the task set.
Fetch a webpage via Exa and return an LLM-generated summary. Accepts an optional query that steers what the summary focuses on.
| name | cleanup-tasks |
| description | Diagnose and repair-or-remove broken tasks under environments/general_agent/tasks. Use after `general-agent validate` reports [FAIL] entries (post-synthesis, after a refactor, or on a periodic sweep). |
general-agent validate catches every task whose gold.json can't be replayed on its db.json, or whose verify(gold_db) != 1.0. Most failures are shallow data bugs that a targeted fix can salvage. A minority need deeper rewrites — delete those and move on.
# 1. Snapshot the current failures
uv run general-agent validate 2>&1 | grep '\[FAIL\]' > /tmp/fails.txt
uv run general-agent validate --fail-only > /tmp/fails_only.txt
wc -l /tmp/fails_only.txt
# 2. Group by error class (most common → rarest)
grep '\[FAIL\]' /tmp/fails.txt \
| awk -F'— \\[FAIL\\] ' '{print $2}' \
| sort | uniq -c | sort -rn
# 3. Run the diagnose helper on the failing set — prints one line per task
# with gold trace length, verify score, and the first missing ref (if any)
uv run python environments/general_agent/skills/cleanup-tasks/diagnose.py $(cat /tmp/fails_only.txt)
# 4. Triage: fix the shallow ones, delete the rest (see heuristics below)
# 5. Re-validate — should be empty
uv run general-agent validate --fail-only
gold replay error: Expecting property name enclosed in double quotesJSON syntax error in db.json. Always fixable. Run python3 -m json.tool < tasks/<name>/db.json; the tool points at the bad line. Usually a missing opening quote (id": "X" instead of "id": "X") or a stray comma.
gold replay error: <Entity> <id> not foundgold.json references an entity not in db.json. Three sub-cases:
# Copy an existing entry as a template and add the missing id
db['fish'].append({"id": "fish-dwarf-gourami", **realistic_fields})
Good when ≤2 entities need to be added and the schema is well understood (one existing record covers it).TEAM-PHX while db has TEAM-001..TEAM-016. Always delete; rewriting either side is a full rescope.gold replay error: Unknown tool: <name>Gold was written against an older/different tools.py. Delete unless the missing tool is a rename of something still present (rare).
gold solution did not change DBGold is a no-op. Always a task-authoring bug. Delete.
verify(gold_db) = X, expected 1.0The gold replays cleanly but the task's own verify() function rejects the resulting state.
embassy_t4 had target_application_ids = ['VA-001','VA-002','VA-003','VA-004'] but the instruction + gold only cover three applicants; remove VA-004.[WARN] missing verify()Not counted as failing by default, but these tasks can only be scored via DB-hash match (which requires the agent to replay the gold exactly). Either add a verify() to tools.py or accept the limitation — they won't block --fail-only.
Task definitions are recoverable via git log -- environments/general_agent/tasks/<name> — deletion is not permanent if someone later wants to repair one.
| Symptom | Fix |
|---|---|
| Bad quote in db.json | Edit the one line |
| ≤2 entity ids missing | Append realistic records to db.json using an existing one as template |
| Stale target id list | Trim to match what gold covers (usually matches the instruction) |
| Budget < gold trace total | Bump budget in db.json |
After triage, if you want to drop whole families when any tier fails:
uv run general-agent validate --fail-only \
| awk -F'_t[0-9]+' '{print $1}' | sort -u \
| while read fam; do
rm -rf environments/general_agent/tasks/"$fam"_t*
done
But this is aggressive — typical rate is ~3.5× the tasks lost vs. deleting only the failing tiers.
DBAssertRubric.db_hash / verify) and logs a warning per bad task.~/.cache/general-agent/rlm-skills/<task_name>/ to force regeneration — that cache is task-tool-agnostic and doesn't need invalidation when you edit db.json.