| name | autoresearch |
| description | Apply Karpathy's autoresearch loop (goal + mechanical fitness + mutable surface + keep-or-revert iteration) to ANY measurable workflow - code, content, sales, research, design, operations, not just ML or software. Use when the user asks to set up an overnight improvement loop, a keep-or-revert experiment workflow, iterative optimization, or asks "can I autoresearch this?". Includes a pre-loop triage that refuses fat-tailed, reflexive, or slow-feedback problems without adapting the mode. |
Autoresearch (generalized)
Karpathy's loop in one breath: pick a goal, define a mechanical fitness function, constrain a mutable surface, iterate keep-or-revert until budget or plateau. This skill ports it to any domain.
The hard-won lesson: most candidate workflows are not autoresearch-shaped. Running the loop on a fat-tailed, reflexive, or slow-feedback problem converges to a local optimum of the wrong thing - very efficiently. This skill's main job is to catch that before the loop starts. Saying "this isn't autoresearch-shaped, here's why, here's what to do instead" is the correct output when a candidate fails triage.
When this skill applies
Invoke when the user:
- wants to scaffold or run an iterative improvement loop on any workflow
- mentions Karpathy, autoresearch, keep-or-revert, overnight optimization, Tobi's Liquid, pi-autoresearch, "the Karpathy loop"
- asks "can I apply autoresearch to X"
- has a measurable goal and wants autonomous iteration
- says "overnight agent," "experiment loop," "fitness function," "val_bpb"
Do NOT invoke for:
- one-shot optimization (no iteration needed - just do the thing)
- pure subjective quality with no path to a mechanical metric and no willingness to build one
- safety-critical / irreversible actions where keep-or-revert cannot be automated
- architecture or strategy questions (use the council skill instead - "use autoresearch for plumbing, never for architecture")
Simplicity principles (Karpathy, verbatim)
Baked into every stage. Non-negotiable. Quoted directly from Karpathy's program.md:
-
One atomic change per iteration. Not five. Not "a batch of related tweaks." One. Revert is trivial; diagnosis is tractable.
-
Small mutable surface. Karpathy's original had exactly ONE file mutable (train.py) and one file read-only (prepare.py). If your surface is "the whole codebase" or "the whole prompt," the loop cannot converge. Scope down before looping.
-
Simple fitness function. One command. One number. Fixed corpus. Karpathy's was grep "^val_bpb:" run.log. If your fitness needs a 200-line harness, your fitness is wrong.
-
Simple loop driver. ~80 lines of bash, or one prompt that "LOOPs FOREVER" in a single session. Not a framework. Not microservices. Karpathy's entire system is train.py + prepare.py + results.tsv + program.md - four files.
-
The simplicity criterion (Karpathy's exact rule for keep-or-revert decisions):
"All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it. Conversely, removing something and getting equal or better results is a great outcome - that's a simplification win. When evaluating whether to keep a change, weigh the complexity cost against the improvement magnitude. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 val_bpb improvement from deleting code? Definitely keep. An improvement of ~0 but much simpler code? Keep."
Decision table:
| Fitness delta | Complexity delta | Decision |
|---|
| Small improvement | Added ugly code | Revert |
| Small improvement | Deleted code | Keep (simplification win) |
| ~0 | Much simpler | Keep (deletion alone is valuable) |
| Big improvement | Added code | Keep (paid for by the gain) |
| Big improvement | Deleted code | Definitely keep |
This is NOT a rule about change magnitude ("be bold"). It is a rule about the trade-off between complexity cost and fitness benefit. Simplifications always win, even at zero fitness improvement.
-
When stuck, escalate ambition - not iteration count. Karpathy: "If you run out of ideas, think harder - read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes." If timid tweaks plateau, the answer is BOLDER changes, not more timid ones. See references/anti-patterns.md #13.
-
Hard iteration cap + autonomous execution. The loop driver enforces a max iteration count (default 50) and a dollar budget. Within those limits, the loop runs AUTONOMOUSLY - never pausing to ask permission. Karpathy, verbatim:
"Once the experiment loop has begun, do NOT pause to ask the human if you should continue. Do NOT ask 'should I keep going?' or 'is this a good stopping point?'. The human might be asleep. You are autonomous. The loop runs until the human interrupts you, period."
The cap is a termination condition, not an "ask the human" event. The loop stops when: cap hit, budget exhausted, plateau detected (10 iterations without significance-threshold improvement), guard violation, or manual interrupt. Nothing else.
These are why Karpathy's autoresearch.py is ~630 lines, not 6300. Complexity is the enemy of convergence.
Five stages, in order
- Intake - what, how measured, what can change, what can't regress, budget
- Triage - is this problem autoresearch-shaped? (see
references/triage-checklist.md)
- Fitness design - build or validate the fitness function (see
references/fitness-design.md)
- Mode selection - pure / barbell / via negativa / inverted / human-in-loop (see
references/modes.md)
- Loop + post-mortem - scaffold, run, diagnose plateau if reached (see
references/anti-patterns.md when diagnosing)
Skip stages only when the user explicitly says "I've done that, start the loop."
Reference files map:
references/triage-checklist.md - Stage 2 (is this problem autoresearch-shaped)
references/fitness-design.md - Stage 3 (the 7 requirements a fitness function must meet)
references/modes.md - Stage 4 (pick one: pure / barbell / via negativa / inverted / human-in-loop)
references/anti-patterns.md - Stages 3 + 5 + 6 (14 named failure modes)
references/plateau-ideation.md - Stage 5/6 (what to propose when the loop plateaus)
Load each reference file on demand during the stage that needs it. Do not preload.
Stage 1: Intake
Ask all five in one structured message, not one at a time:
- Goal - what should be better after this runs? State as "before → after."
- Candidate metric - how would you know it's better? A number is ideal; a deterministic rubric is the fallback.
- Mutable surface - what files, prompts, config, copy, or parameters can the agent change? Be specific.
- Guard - what must not regress? Safety tests, brand rules, legal constraints, downstream dependencies.
- Budget - iterations, wall-clock, or dollars.
If the user cannot answer 2 or 3, continue anyway. Triage will surface it and Stage 3 will force a resolution.
Stage 2: Triage
Load references/triage-checklist.md. Walk the candidate through five dimensions and score each green / yellow / red:
- Feedback latency
- Metric mechanicality
- Tail shape (thin vs fat-tailed)
- Sample size vs noise
- Surface locality (atomic diff vs system redesign)
Decision rules:
- All green → pure optimizer mode, proceed to Stage 3
- Any yellow → mode adaptation required, proceed to Stage 3 with concerns noted
- Any red → either adapt the mode or refuse the loop and offer alternatives (council, human-in-loop, fix the upstream bottleneck first, smaller better-shaped sub-problem)
Refusal is a feature. Do not start the loop to be agreeable.
Stage 3: Fitness design
Load references/fitness-design.md. The fitness function is where autoresearch dies when it dies - not in the loop mechanics.
Required before the loop can start:
- Fitness is a command (or deterministic chain) that outputs a number
- It runs on a fixed corpus with fixed seeds (stationary across iterations)
- Baseline has been recorded to
baseline.json
- Significance threshold defined (what delta is real vs noise)
- Sanity check done: score known-good and known-bad by hand, confirm the metric agrees
- Guard metrics defined: what must not regress
If 1-4 are not met, the skill's answer is: "fix the fitness function first, then come back." This is the single highest-leverage intervention. Do not skip it to be helpful.
Stage 4: Mode selection
Load references/modes.md. Pick exactly one:
| Mode | When |
|---|
| Pure optimizer | Small mutable surface, fast feedback, clean metric, thin-tailed |
| Barbell | Fat-tailed, reflexive domain - 85% immutable proven baseline + 15% wild experiments |
| Via negativa | Fragile system - ask "what to kill" not "what to add" |
| Inverted | Imbalanced classes - optimize rejection of bad, not selection of good |
| Human-in-loop | Slow feedback or irreversible actions - agent proposes, human decides |
Record the mode in goal.md. It controls what the loop is allowed to propose.
Stage 5: Loop + post-mortem
Scaffold (once)
Create in the user's working directory (or a subdirectory they pick):
goal.md - filled from templates/goal.md: goal / metric / surface / guard / mode / budget
- The fitness command (script, notebook, evaluator - user's choice, not the skill's)
experiments/ directory for per-iteration logs
baseline.json with the starting metric value and sanity check results
loop-driver.sh and iteration-prompt.md from templates/ (for autonomous runs)
Drive the loop (autonomously)
Claude Code's interactive mode pauses between turns. To run the loop continuously without a human pressing enter each iteration, drive it from OUTSIDE the Claude Code session via a script. Each iteration is one fresh claude -p (non-interactive print mode) invocation that reads state from goal.md + experiments/, proposes one atomic change, runs fitness, commits or reverts, appends to journal.md, and exits.
Three options, in order of simplicity:
-
Bash + claude -p (recommended, simplest). Use templates/loop-driver.sh and templates/iteration-prompt.md. Run:
./loop-driver.sh --max-iter 50 --max-budget-usd 10
Uses Claude Code's -p / --print mode, --permission-mode acceptEdits for tool autonomy, --max-budget-usd for a built-in dollar cap. State lives in git + files, not session memory. Each iteration is a fresh session reading the journal.
-
Python + Anthropic SDK - what Karpathy's original autoresearch.py does. Use when you need programmatic state tracking, richer plateau detection, or parallel experiments. More code, more control.
-
Interactive / manual - user approves each iteration in a single Claude Code session. OK for N ≤ 5. Tedious above that.
Hard caps are mandatory. Every driver MUST enforce:
- Max iterations - default 50, hard cap. Raise only with a written reason in
goal.md.
- Dollar budget - via
claude -p --max-budget-usd.
- Wall-clock budget - via
timeout or driver-level check.
- Plateau detection - default 10 iterations without significance-threshold improvement → stop.
- Guard violation - immediate stop, no retries.
Any trigger writes experiments/STOP with a one-line reason and exits. Stage 6 post-mortem reads this file and produces the plateau diagnosis.
What does NOT work: Claude Code hooks (Stop, PostToolUse, UserPromptSubmit, SubagentStop). Hooks REACT to Claude's actions - they cannot inject new prompts that drive iteration. The loop driver MUST live outside the Claude Code session.
Run the loop
- Read state: git log,
experiments/, last N lessons learned
- Propose ONE focused change, respecting the mode
- Commit the proposal with an
experiment: prefix BEFORE running (so git can revert)
- Run the fitness command
- Compare to baseline (or best-so-far) with the significance threshold
- Keep (advance baseline) or revert (
git revert)
- Append a one-line lesson to
experiments/journal.md
- Check termination: budget exhausted, plateau detected (10 iterations without improvement), guard violated → Stage 6
Non-negotiable principles:
- One change per iteration
- Mechanical verification only (no "looks good")
- Automatic rollback on regression or guard failure
- Git is memory
Post-mortem (when the loop stops)
Produce:
- Final metric vs baseline, with significance
- Top 3 changes that stuck, top 3 changes that were reverted
- Plateau diagnosis if plateaued - consult
references/anti-patterns.md and name one:
- Wrong search space → change the mutable surface
- Wrong metric → fitness does not correlate with the real goal
- Broken measurement → sample size, stationarity, or confounded labels
- Genuine local optimum → switch mode to barbell
- Out of budget → more iterations would help
Before declaring plateau, run the ideation workflow in references/plateau-ideation.md. This is mandatory, not optional. Specifically:
- Mine the last 10 reverts for pattern (axis + failure mode for each). Tradeoff reverts (primary lifted, guard broke) are especially rich - they reveal axes with slack that need compensating moves elsewhere.
- Run the taxonomy coverage check. If fewer than ~6 of 8 axes have been touched with substantive changes, the loop is NOT plateaued - it has exhausted one corner of the search space. Propose on an untouched axis and continue.
- Only when coverage is satisfied AND revert mining surfaces no untouched direction: invoke
/council with the three structured prompts from plateau-ideation.md Move 3.
- Only when all three are exhausted: declare plateau and pick a cause above.
"I ran out of ideas" is anti-pattern #13 (timid tweaks). The plateau-ideation workflow is how the skill converts that into a concrete next move instead of a stop.
The plateau diagnosis IS the deliverable when the loop doesn't improve. Foreground it, don't bury it.
Escalation: invoke /council on a plateaued loop
When the loop plateaus and you own the mutable surface, the mode, and the fitness function, the failure is usually invisible from inside the loop - wrong sample size, unverified compliance, a metric that doesn't discriminate, confounded labels, or a mutable surface that no longer holds leverage. These are exactly the findings a multi-perspective review delivers cheaply in one pass.
Before switching modes or mutable surfaces, spend a single /council call. Canonical prompt:
Goal, fitness, baseline, significance threshold, mode, last 10 journal lines - the loop has plateaued. What is the loop missing that an outsider would notice?
A council that agrees with your reading tells you the loop is genuinely stuck on the right problem. A council that disagrees usually names the specific failure mode from references/anti-patterns.md - which is exactly the plateau diagnosis you were about to struggle to produce alone.
This is different from anti-pattern #10 ("use council for architecture, not tactics"). Council here is a second-opinion tool on the loop itself, not an architecture substitute. On a real autoresearch run, a plateau-council review produced the two highest-leverage fixes (enlarge sample size, verify compliance) that the loop couldn't see from inside.
Style rules for this skill
- Declarative: tell the user WHAT to check, not exact commands. The agent picks the command.
- Refuse gracefully and suggest alternatives when a candidate fails triage
- Keep the loop boring - creativity lives in triage and mode selection, not loop mechanics
- Git is the experiment log;
journal.md is the narrative
- Use plain
- hyphens, not em dashes