تشغيل أي مهارة في Manus بنقرة واحدة

autoresearch

النجوم٠

التفرعات٠

آخر تحديث٦ يونيو ٢٠٢٦ في ٢٢:٤٠

Run an iterative, measured optimization loop — try an idea, benchmark it, keep what improves the metric, revert what doesn't, repeat — leaving an auditable trail of commits. Use this when the user wants to ITERATIVELY optimize a measurable target and expects to try many attempts: speeding up a test suite or build, shrinking a bundle/binary, cutting latency or memory, tuning hyperparameters to lower a loss, raising a Lighthouse/benchmark score. Strong fit when the metric is NOISY (needs repeated runs to trust), correctness must be preserved while optimizing, or a reviewable record of what was tried and kept matters. Trigger for "run autoresearch", "optimize X in a loop", "set up experiments to make X faster", "keep trying changes until the benchmark improves", "tune this until the loss is low", or "measure it properly, don't guess". Less suited to one-shot fixes you already know — see the "When NOT to use this" note before reaching for it on trivial changes.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

agusmdev

agusmdev/autoresearch-skill

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مطوّرو البرمجياتمهن الحاسوب والرياضيات·SOC 15-1252

SKILL.md

readonly

name

autoresearch

description

Autoresearch

Autonomous experiment loop: try an idea, measure it, keep what improves the metric, revert what doesn't, and keep going. Inspired by the autoresearch pattern — hill-climbing a benchmark through many small, measured changes.

Modeled on pi-autoresearch by davebcn87 (itself inspired by karpathy/autoresearch), reimplemented as a Claude skill. Credit to the original authors.

The model is simple and the discipline is the whole point: never change code without measuring the effect, and never keep a change that didn't help. Every kept experiment becomes one commit on a single branch, so git log is a clean, reviewable record of what actually moved the number.

You drive this loop yourself with three bundled scripts that handle the fiddly, error-prone bookkeeping (timing, metric parsing, git commit/revert, the run log, confidence scoring) so every iteration is consistent and the result is auditable.

What this buys you, and what it costs. The payoff is process: a reproducible benchmark, a commit per real improvement, a record of what was tried and discarded, and a confidence signal that distinguishes real gains from noise. The cost is real too — the disciplined loop (many runs, writing the trail, re-running to confirm noisy wins) takes more wall-clock and tokens than just making an edit. That trade is worth it when you'll try many ideas, when the metric is noisy, when correctness must hold, or when someone will review the history. It's not worth it for a change you already know is right.

When NOT to use this

Be honest with yourself before starting — a heavyweight loop on a trivial task is waste. Skip autoresearch (just make the change directly) when:

You already know the fix. A one-shot edit you're confident in doesn't need a measured loop; make it, verify once, done.
The metric is trivial to eyeball and correctness is obvious. If "is it faster?" is a single obvious yes, the bookkeeping adds nothing.
No reviewable history is needed and you'll try only one or two things.
There's no stable way to measure the target (the metric can't be made repeatable, or a single run takes so long that iterating is impractical). Consider whether a cheaper proxy metric exists first; if not, this loop isn't the right tool.

Reach for it when the work is genuinely iterative and measurement is the point.

The scripts (use these, don't reinvent them)

Script	What it does
`scripts/ar_run.sh`	Runs the benchmark (`./autoresearch.sh`), times it, parses `METRIC name=value` lines, runs optional correctness checks.
`scripts/ar_log.py`	Records a run: commits on `keep`, reverts code on `discard`/`crash`/`checks_failed` (keeping `autoresearch.*`), appends to `autoresearch.jsonl`, computes confidence.
`scripts/ar_status.py`	Prints a dashboard from the log: baseline vs best, % gain, confidence, recent runs. Use it to reorient.

Full semantics, JSONL format, the confidence formula, and resume instructions are in references/mechanics.md — read it if anything below is ambiguous.

Setup (do this once, then loop)

Pin down the target. Ask, or infer from context:
- Goal — what are we optimizing? (e.g. "speed up the test suite")
- Command — what runs the workload? (e.g. pnpm test)
- Metric + direction — what number, lower or higher is better? (e.g. seconds, lower)
- Files in scope — what may be changed?
- Constraints — what must stay true? (tests pass, no new deps, identical behavior)
Don't over-interrogate. If the user said "make the tests faster", you already have goal, command, and metric — confirm the rest only if it's genuinely unclear.
Branch from a clean tree. First ensure the working tree is clean (git status --porcelain is empty) — the loop's revert throws away uncommitted changes to non-autoresearch.* files, so commit or stash the user's work first. Then git checkout -b autoresearch/<goal-slug>-<date> to isolate the experiment commits. (Not a git repo? git init and commit the starting state first — the loop needs git to keep/revert, and without it a failed experiment cannot be rolled back.)
Understand the workload before touching it. Read the source files in scope. The best optimizations come from understanding what the code actually does, not from random variation. This reading is not optional overhead — it's where the ideas come from.
Write autoresearch.md and autoresearch.sh. Copy the templates from assets/ and fill them in. autoresearch.md is the heart of the session — invest in it, because a fresh agent (or you after a context reset) resumes from it alone. autoresearch.sh must print METRIC <primary>=<value> and be FAST (every second is multiplied by hundreds of runs; for sub-5s noisy benchmarks, run a few times and report the median so confidence is meaningful early). Commit both.

Initialize and take a baseline.

python3 scripts/ar_log.py --init --name "<session>" --metric-name <primary> --unit <ms|s|kb|...> --direction <lower|higher>
bash scripts/ar_run.sh
python3 scripts/ar_log.py --status keep --metric <baseline_value> --desc "baseline" --asi '{"hypothesis":"baseline measurement"}'

The baseline is the number every later run is judged against.

The loop

Repeat until a stop condition (see below). One iteration:

Pick the most promising idea. From your understanding of the workload, the autoresearch.ideas.md backlog, or the next_action_hint of recent runs.
Make the change for ONE hypothesis. One idea per experiment, so the measurement attributes cleanly. "One hypothesis" — not "one line": a single coherent idea often spans several edits (swap a data structure → touch its constructor, call sites, and serialization together). That's fine; it's still one experiment. What to avoid is bundling unrelated ideas into one run, because then a mixed result tells you nothing about which idea helped.
Measure: bash scripts/ar_run.sh
Decide and record with scripts/ar_log.py:
- Primary metric improved → --status keep (it gets committed).
- Worse or unchanged → --status discard (code auto-reverts).
- Benchmark failed to run → --status crash --metric 0.
- Benchmark ran but checks failed → --status checks_failed.
- Always pass --asi with at least {"hypothesis":"..."}. On discard/crash add rollback_reason and next_action_hint. After a revert the code is gone — this annotation is the only memory that survives, so future iterations don't re-walk the same dead end.
Periodically update autoresearch.md's "What's Been Tried" — curated wins and dead ends, more useful than the raw log.

Decision principles

The primary metric is king. Secondary metrics are monitors; they almost never override a real primary win. Only discard a genuine improvement if a secondary degraded catastrophically — and explain why in --desc.
Simpler is better. Removing code for equal performance? Keep it. Ugly complexity for a tiny gain? Probably discard.
Watch confidence. After 3+ runs the logger prints a confidence score (best improvement as a multiple of the noise floor). <1.0× means the "win" is within noise — re-run to confirm before trusting it. It's advisory; you decide.
Don't thrash. Reverting variations of the same idea repeatedly? Stop and try something structurally different. Re-read the source, study the profiling output, reason about what the machine is actually doing.
Crashes: fix if trivial, else log and move on. Don't over-invest in a dead path.
Park big ideas. Promising but expensive optimizations you won't do now → append a bullet to autoresearch.ideas.md. Don't lose them.

Don't game the benchmark

The point is real improvement, and the loop only works if the metric stays honest. The danger is structural: autoresearch.sh defines the metric, it's preserved through every revert, and you're allowed to edit it mid-loop — so it's the easiest thing to accidentally (or lazily) corrupt into reporting a fake win. Guard it deliberately:

Pin the workload. Fix the inputs the benchmark runs on (a fixed N, a specific input file). If you must change autoresearch.sh, log that change as its own experiment with a clear description — never silently alter what's measured between runs, or your deltas become meaningless.
Don't special-case the benchmark's inputs or weaken what it checks to move the number. Don't peek at a known-answer and hardcode toward it — the goal is to discover the win through measurement, not to reproduce an answer you looked up.
If you ever feel tempted to make the number move without making the thing genuinely better, stop and rethink. A fake win silently wastes every iteration built on top of it, and the backpressure checks (if any) won't catch a metric that's lying.

Stop conditions

This loop is meant to run autonomously — don't stop after one experiment to ask "should I continue?". Keep iterating until one of:

The user gave a budget (e.g. "try 20 experiments" or maxIterations) and it's hit.
You've genuinely exhausted promising ideas (the backlog is empty and recent runs are all marginal-or-worse with high confidence).
The user interrupts.

When you stop, run scripts/ar_status.py and report: baseline → best, % improvement, how many experiments, what worked, what didn't, and what's left in the ideas backlog.

How the autonomy actually works (be honest about it)

This is a resumable manual loop, not a background daemon. "Autonomously" means: within a turn, you keep running iterations instead of stopping to ask permission. You are the loop driver. There is no scheduler that wakes you up on its own.

What makes it survive interruptions is the persisted state, not magic: every result is in autoresearch.jsonl and every keep is a commit, so a fresh agent can rebuild full context from disk. To run hands-off across context limits, wrap the loop:

Claude Code /loop — re-enters the skill each cycle. Give it a resume prompt like: /loop continue the autoresearch session in this repo: read autoresearch.md, run scripts/ar_status.py, then run the next experiment
ralph-loop plugin — a Stop hook re-feeds a prompt until a completion phrase. Use the same resume instruction as the prompt.

Either way the resume steps below are what each cycle does.

Resuming an existing session

If autoresearch.jsonl already exists, don't start over — rebuild context from disk:

Read autoresearch.md (curated context: objective, scope, what's been tried).
Run scripts/ar_status.py for the run history, baseline, best, and confidence.
Check autoresearch.ideas.md if present; prune stale entries.
Skim git log to see the kept commits.

Recover from a mid-iteration crash first. The risky state is "I edited code but hadn't logged the run yet" — the working tree is dirty and there's no matching jsonl entry. Before continuing, check git status: if there are uncommitted non-autoresearch.* changes with no corresponding final run in the log, you don't know if they help. The safe move is to measure them as a fresh experiment (scripts/ar_run.sh → scripts/ar_log.py) so they're either kept-and-committed or reverted — never carry an unmeasured change forward. Then continue the loop normally.