| name | autoresearch |
| description | Run an iterative, measured optimization loop — try an idea, benchmark it, keep what improves the metric, revert what doesn't, repeat — leaving an auditable trail of commits. Use this when the user wants to ITERATIVELY optimize a measurable target and expects to try many attempts: speeding up a test suite or build, shrinking a bundle/binary, cutting latency or memory, tuning hyperparameters to lower a loss, raising a Lighthouse/benchmark score. Strong fit when the metric is NOISY (needs repeated runs to trust), correctness must be preserved while optimizing, or a reviewable record of what was tried and kept matters. Trigger for "run autoresearch", "optimize X in a loop", "set up experiments to make X faster", "keep trying changes until the benchmark improves", "tune this until the loss is low", or "measure it properly, don't guess". Less suited to one-shot fixes you already know — see the "When NOT to use this" note before reaching for it on trivial changes. |
Autoresearch
Autonomous experiment loop: try an idea, measure it, keep what improves the
metric, revert what doesn't, and keep going. Inspired by the autoresearch
pattern — hill-climbing a benchmark through many small, measured changes.
Modeled on pi-autoresearch by
davebcn87 (itself inspired by karpathy/autoresearch),
reimplemented as a Claude skill. Credit to the original authors.
The model is simple and the discipline is the whole point: never change code
without measuring the effect, and never keep a change that didn't help. Every
kept experiment becomes one commit on a single branch, so git log is a clean,
reviewable record of what actually moved the number.
You drive this loop yourself with three bundled scripts that handle the fiddly,
error-prone bookkeeping (timing, metric parsing, git commit/revert, the run log,
confidence scoring) so every iteration is consistent and the result is auditable.
What this buys you, and what it costs. The payoff is process: a reproducible
benchmark, a commit per real improvement, a record of what was tried and discarded,
and a confidence signal that distinguishes real gains from noise. The cost is real
too — the disciplined loop (many runs, writing the trail, re-running to confirm noisy
wins) takes more wall-clock and tokens than just making an edit. That trade is worth
it when you'll try many ideas, when the metric is noisy, when correctness must hold,
or when someone will review the history. It's not worth it for a change you already
know is right.
When NOT to use this
Be honest with yourself before starting — a heavyweight loop on a trivial task is waste.
Skip autoresearch (just make the change directly) when:
- You already know the fix. A one-shot edit you're confident in doesn't need a
measured loop; make it, verify once, done.
- The metric is trivial to eyeball and correctness is obvious. If "is it faster?"
is a single obvious yes, the bookkeeping adds nothing.
- No reviewable history is needed and you'll try only one or two things.
- There's no stable way to measure the target (the metric can't be made
repeatable, or a single run takes so long that iterating is impractical). Consider
whether a cheaper proxy metric exists first; if not, this loop isn't the right tool.
Reach for it when the work is genuinely iterative and measurement is the point.
The scripts (use these, don't reinvent them)
| Script | What it does |
|---|
scripts/ar_run.sh | Runs the benchmark (./autoresearch.sh), times it, parses METRIC name=value lines, runs optional correctness checks. |
scripts/ar_log.py | Records a run: commits on keep, reverts code on discard/crash/checks_failed (keeping autoresearch.*), appends to autoresearch.jsonl, computes confidence. |
scripts/ar_status.py | Prints a dashboard from the log: baseline vs best, % gain, confidence, recent runs. Use it to reorient. |
Full semantics, JSONL format, the confidence formula, and resume instructions are
in references/mechanics.md — read it if anything below is ambiguous.
Setup (do this once, then loop)
-
Pin down the target. Ask, or infer from context:
- Goal — what are we optimizing? (e.g. "speed up the test suite")
- Command — what runs the workload? (e.g.
pnpm test)
- Metric + direction — what number, lower or higher is better? (e.g. seconds, lower)
- Files in scope — what may be changed?
- Constraints — what must stay true? (tests pass, no new deps, identical behavior)
Don't over-interrogate. If the user said "make the tests faster", you already
have goal, command, and metric — confirm the rest only if it's genuinely unclear.
-
Branch from a clean tree. First ensure the working tree is clean
(git status --porcelain is empty) — the loop's revert throws away uncommitted
changes to non-autoresearch.* files, so commit or stash the user's work first.
Then git checkout -b autoresearch/<goal-slug>-<date> to isolate the experiment
commits. (Not a git repo? git init and commit the starting state first — the loop
needs git to keep/revert, and without it a failed experiment cannot be rolled back.)
-
Understand the workload before touching it. Read the source files in scope.
The best optimizations come from understanding what the code actually does, not
from random variation. This reading is not optional overhead — it's where the
ideas come from.
-
Write autoresearch.md and autoresearch.sh. Copy the templates from
assets/ and fill them in. autoresearch.md is the heart of the session —
invest in it, because a fresh agent (or you after a context reset) resumes from
it alone. autoresearch.sh must print METRIC <primary>=<value> and be FAST
(every second is multiplied by hundreds of runs; for sub-5s noisy benchmarks,
run a few times and report the median so confidence is meaningful early).
Commit both.
-
Initialize and take a baseline.
python3 scripts/ar_log.py --init --name "<session>" --metric-name <primary> --unit <ms|s|kb|...> --direction <lower|higher>
bash scripts/ar_run.sh
python3 scripts/ar_log.py --status keep --metric <baseline_value> --desc "baseline" --asi '{"hypothesis":"baseline measurement"}'
The baseline is the number every later run is judged against.
The loop
Repeat until a stop condition (see below). One iteration:
- Pick the most promising idea. From your understanding of the workload, the
autoresearch.ideas.md backlog, or the next_action_hint of recent runs.
- Make the change for ONE hypothesis. One idea per experiment, so the
measurement attributes cleanly. "One hypothesis" — not "one line": a single
coherent idea often spans several edits (swap a data structure → touch its
constructor, call sites, and serialization together). That's fine; it's still one
experiment. What to avoid is bundling unrelated ideas into one run, because then
a mixed result tells you nothing about which idea helped.
- Measure:
bash scripts/ar_run.sh
- Decide and record with
scripts/ar_log.py:
- Primary metric improved →
--status keep (it gets committed).
- Worse or unchanged →
--status discard (code auto-reverts).
- Benchmark failed to run →
--status crash --metric 0.
- Benchmark ran but checks failed →
--status checks_failed.
- Always pass
--asi with at least {"hypothesis":"..."}. On discard/crash
add rollback_reason and next_action_hint. After a revert the code is gone —
this annotation is the only memory that survives, so future iterations don't
re-walk the same dead end.
- Periodically update
autoresearch.md's "What's Been Tried" — curated wins
and dead ends, more useful than the raw log.
Decision principles
- The primary metric is king. Secondary metrics are monitors; they almost never
override a real primary win. Only discard a genuine improvement if a secondary
degraded catastrophically — and explain why in
--desc.
- Simpler is better. Removing code for equal performance? Keep it. Ugly
complexity for a tiny gain? Probably discard.
- Watch confidence. After 3+ runs the logger prints a confidence score (best
improvement as a multiple of the noise floor).
<1.0× means the "win" is within
noise — re-run to confirm before trusting it. It's advisory; you decide.
- Don't thrash. Reverting variations of the same idea repeatedly? Stop and try
something structurally different. Re-read the source, study the profiling output,
reason about what the machine is actually doing.
- Crashes: fix if trivial, else log and move on. Don't over-invest in a dead path.
- Park big ideas. Promising but expensive optimizations you won't do now →
append a bullet to
autoresearch.ideas.md. Don't lose them.
Don't game the benchmark
The point is real improvement, and the loop only works if the metric stays honest.
The danger is structural: autoresearch.sh defines the metric, it's preserved
through every revert, and you're allowed to edit it mid-loop — so it's the easiest
thing to accidentally (or lazily) corrupt into reporting a fake win. Guard it
deliberately:
- Pin the workload. Fix the inputs the benchmark runs on (a fixed N, a specific
input file). If you must change
autoresearch.sh, log that change as its own
experiment with a clear description — never silently alter what's measured between
runs, or your deltas become meaningless.
- Don't special-case the benchmark's inputs or weaken what it checks to move the
number. Don't peek at a known-answer and hardcode toward it — the goal is to
discover the win through measurement, not to reproduce an answer you looked up.
- If you ever feel tempted to make the number move without making the thing genuinely
better, stop and rethink. A fake win silently wastes every iteration built on top of
it, and the backpressure checks (if any) won't catch a metric that's lying.
Stop conditions
This loop is meant to run autonomously — don't stop after one experiment to ask
"should I continue?". Keep iterating until one of:
- The user gave a budget (e.g. "try 20 experiments" or
maxIterations) and it's hit.
- You've genuinely exhausted promising ideas (the backlog is empty and recent runs
are all marginal-or-worse with high confidence).
- The user interrupts.
When you stop, run scripts/ar_status.py and report: baseline → best, % improvement,
how many experiments, what worked, what didn't, and what's left in the ideas backlog.
How the autonomy actually works (be honest about it)
This is a resumable manual loop, not a background daemon. "Autonomously" means:
within a turn, you keep running iterations instead of stopping to ask permission. You
are the loop driver. There is no scheduler that wakes you up on its own.
What makes it survive interruptions is the persisted state, not magic: every result
is in autoresearch.jsonl and every keep is a commit, so a fresh agent can rebuild
full context from disk. To run hands-off across context limits, wrap the loop:
- Claude Code
/loop — re-enters the skill each cycle. Give it a resume prompt like:
/loop continue the autoresearch session in this repo: read autoresearch.md, run scripts/ar_status.py, then run the next experiment
- ralph-loop plugin — a Stop hook re-feeds a prompt until a completion phrase. Use
the same resume instruction as the prompt.
Either way the resume steps below are what each cycle does.
Resuming an existing session
If autoresearch.jsonl already exists, don't start over — rebuild context from disk:
- Read
autoresearch.md (curated context: objective, scope, what's been tried).
- Run
scripts/ar_status.py for the run history, baseline, best, and confidence.
- Check
autoresearch.ideas.md if present; prune stale entries.
- Skim
git log to see the kept commits.
Recover from a mid-iteration crash first. The risky state is "I edited code but
hadn't logged the run yet" — the working tree is dirty and there's no matching jsonl
entry. Before continuing, check git status: if there are uncommitted non-autoresearch.*
changes with no corresponding final run in the log, you don't know if they help. The
safe move is to measure them as a fresh experiment (scripts/ar_run.sh →
scripts/ar_log.py) so they're either kept-and-committed or reverted — never carry an
unmeasured change forward. Then continue the loop normally.