원클릭으로
iterate
// Generalized recursive iteration loop. Runs parallel sub-agents against a target, scores deterministically, diagnoses instruction gaps, applies fixes, and recurses until the stop condition is met or max depth is reached.
// Generalized recursive iteration loop. Runs parallel sub-agents against a target, scores deterministically, diagnoses instruction gaps, applies fixes, and recurses until the stop condition is met or max depth is reached.
Compare two eval runs and report what changed. Reads both runs' events, transcripts, and produced artifacts. Writes a short markdown summary classifying differences as regression, improvement, or neutral.
Run an evaluation against an eval with a specific config
Write a handoff document so a new chat can continue the work
Show evaluation results and comparisons
Test-driven development of a Hono/Bun WebSocket application. Read requirements, read tests, build server, verify, iterate until all tests pass.
Scaffold a new evaluation
| name | iterate |
| description | Generalized recursive iteration loop. Runs parallel sub-agents against a target, scores deterministically, diagnoses instruction gaps, applies fixes, and recurses until the stop condition is met or max depth is reached. |
| argument-hint | <target> [config] [--max-depth N] [--instances N] [--keep] |
Read @.claude/reference/iteration/recursive-training.md first. You are Level 0.
This skill is a generalized iteration engine. It works with ANY target that has a EVAL.md and a verify.sh producing RESULT: PASS or RESULT: FAIL.
iterate(target, config, stop_condition, depth=0, max_depth=N):
results = run_parallel(target, config, instances)
if stop_condition(results) or depth >= max_depth:
return results # BASE CASE
findings = diagnose(results)
config' = apply_fixes(config, findings)
return iterate(target, config', stop_condition, depth+1, max_depth)
Inputs:
target — a directory in evals/ with EVAL.md, prompt.md, verify.shconfig — a .claude/ directory variant in evals/<target>/configs/instances — how many parallel sandboxes to run (default 3)max_depth — maximum iterations before forced stop (default 5)Stop condition (deterministic):
All instances produce RESULT: PASS in the same iteration.
This is the base case. No subjective judgment — pass/fail comes from verify.sh.
Output:
The improved target/.claude/ directory and a results summary.
evals/. Read EVAL.md to confirm.Do NOT launch agents until the user answers.
Use the existing tools at each step — do not reimplement.
session_id = uuid4().hex[:8] # Correlates all iterations in this /iterate invocation
depth = 0
RECURSE:
if depth >= max_depth:
EMIT: apc_log("INFO", "iteration_session_complete", "Max depth reached",
{"session_id": session_id, "final_depth": depth, "converged": false,
"total_cost_usd": <sum from all runs>, "total_duration_ms": <wall clock>})
RETURN with summary: "Max depth reached. Best results: ..."
EMIT: apc_log("INFO", "iteration_started", "Starting iteration",
{"depth": depth, "max_depth": max_depth, "target": target,
"config": config, "session_id": session_id, "instances": N})
1. CLEAN
python3 scripts/cleanup.py
2. PREPARE STIMULI (if any)
Stimuli get injected via --stimuli-dir.
3. LAUNCH
python3 scripts/parallel.py <target> <config> \
--instances N [--stimuli-dir <path>] [--keep]
Collect run IDs from stdout.
For A/B testing configs within iteration:
--configs baseline,experimental
For model comparison within iteration:
--models haiku,sonnet
4. MONITOR (every 60s)
python3 scripts/dashboard.py <run_id> --summary
Build a monitoring table. Stop early if agent is stuck.
5. SCORE
python3 scripts/report.py --score <run_id>
python3 scripts/report.py <run_id1> <run_id2> ... --group-by config
For two-run comparison:
python3 scripts/report.py --compare <id1> <id2>
6. REGRESSION CHECK
If this is not the first iteration, compare each new run against
the corresponding run from the previous iteration:
/compare <prev_iteration_run_id> <this_iteration_run_id>
The /compare sub-agent reads both runs' events, transcripts, and
produced artifacts, and writes a markdown summary of what changed.
Read its output to determine whether this iteration improved on
the last one or regressed.
7. STOP CONDITION CHECK
if all instances PASS and no regressions:
EMIT: apc_log("INFO", "iteration_complete", "Converged",
{"depth": depth, "session_id": session_id, "converged": true, "pass_rate": "N/N"})
EMIT: apc_log("INFO", "iteration_session_complete", "Session converged",
{"session_id": session_id, "final_depth": depth, "converged": true,
"total_cost_usd": <sum>, "total_duration_ms": <wall clock>})
RETURN with summary: "Converged at depth {depth}."
8. OBSERVE (mandatory before diagnosis)
For EACH failing instance:
a. Read tool trace: extract tool sequence from events.jsonl
b. Read produced code: ls results/<run_id>/produced/
c. Read verify.sh output: check verification_output event in events.jsonl
d. If config snapshot exists: python3 scripts/dashboard.py --diff <prev_id> <this_id>
If observation doesn't reveal the cause, escalate:
e. Diff produced code against the known-good solution (if available)
to find where the agent diverged
f. Isolate variables — re-run the same challenge with a different
prompt variant or stripped config to identify which factor matters
g. Check token usage — if the agent used most of its budget,
instructions may be getting compacted away
h. Find the first wrong decision in the tool trace and work forward —
what did the agent do instead of what it should have done?
Do NOT skip this step. Diagnosis without observation leads to wrong fixes.
8.5 REVIEW (when known-good fix exists)
If the challenge has a fix.diff file, launch the reviewer agent:
Agent(reviewer, "Review run {run_id}:
results_dir=evals/{eval}/results/{run_id}
fix_diff=evals/{eval}/challenges/{challenge}/fix.diff
prompt=evals/{eval}/challenges/{challenge}/prompt.md")
The reviewer compares the agent's approach to the known-good fix and
classifies the debugging failure pattern. Its JSON output feeds directly
into the findings table in step 9.
If no fix.diff exists, skip this step and diagnose manually from observation.
9. DIAGNOSE (gate: findings table required)
Produce a findings table with one row per failure. Every row MUST have:
| Instance | What failed | Evidence (file:line or event) | Fix | Level | Component |
A row without evidence in the "Evidence" column is not diagnosed — go back to OBSERVE.
If a reviewer report exists, use its output to populate the table.
If uncertain about Level classification, ask the human.
**Component classification:** For each fix, consult @.claude/reference/components/decision-tree.md
to determine the right component. Walk the tree from the top:
- Is it mechanical (same answer every time)? → Permission or Hook
- Does it need enforcement (cannot be ignored)? → settings.json deny rule
- Does it need custom logic? → PreToolUse hook
- Does it require judgment? → CLAUDE.md, rule, skill, or reference
The Component column must name the specific target: `CLAUDE.md`, `rules/<name>.md`,
`settings.json (permissions.deny)`, `hook (PreToolUse)`, `skill`, etc.
Do not default to CLAUDE.md — let the tree decide.
EMIT: apc_log("INFO", "iteration_diagnosed", "Diagnosis complete",
{"depth": depth, "session_id": session_id,
"findings_count": <N>, "findings_summary": "brief description"})
10. APPLY FIXES
Level 2 → target's .claude/
Level 0 → agent-spec (selective)
After each fix, consistency-check all .claude/ files at that level.
Apply each fix to the component specified in the findings table.
EMIT: apc_log("INFO", "iteration_fixed", "Fixes applied",
{"depth": depth, "session_id": session_id,
"files_changed": ["path/to/file1", "path/to/file2"]})
EMIT: apc_log("INFO", "iteration_complete", "Iteration done",
{"depth": depth, "session_id": session_id, "converged": false,
"pass_rate": "M/N", "duration_ms": <elapsed>})
11. depth += 1
GOTO RECURSE
See @.claude/reference/iteration/recursive-training.md Guard 2 for the full table.
Two heuristics:
Every Level 2 fix must work for ANY stimulus, not just the one that failed. If a fix only helps for one specific input, it is overfitting.
Every Level 0 fix must work for ANY target. If it only helps one project, it belongs in target config.
After the loop terminates (converged or max depth), write results/tuning-handoff.md: