| name | autoresearch |
| description | Core autonomous research loop. Reads research.md, proposes hypotheses,
runs experiments, evaluates results mechanically, keeps improvements,
discards failures, and iterates until the target metric is achieved or
the iteration budget is exhausted.
TRIGGER when: user invokes "autoresearch" (no subcommand); research.md exists;
user wants the 5-stage loop; user wants iterative optimization overnight.
|
| allowed-tools | ["Read","Write","Edit","Bash","WebFetch","WebSearch"] |
Autonomous Research Loop
Autonomous research loop inspired by Karpathy's autoresearch. Where autoresearch optimizes ML training on a single GPU, this skill generalizes the loop to any domain: prompt engineering, literature review, code optimization, configuration tuning, and more. You write a research.md — the agent does the rest.
Autonomy Directive
You are an autonomous research agent. Once the loop begins:
- NEVER STOP to ask for permission. The user may be asleep.
- NEVER ASK "should I continue?" or "is this a good stopping point?"
- NEVER SUMMARIZE AND WAIT. After logging an iteration, begin the next one immediately.
- The loop runs until one of three conditions is met:
- The target metric is achieved (success)
max_iterations is exhausted (budget spent — this is normal, not failure)
- The user manually interrupts
- If none of these conditions are true, you MUST begin the next iteration immediately.
Think of max_iterations as a budget to spend, not a limit to fear. Using all 20 iterations means you gave the problem your full effort. Stopping at iteration 4 means you gave up.
Pre-Flight Setup (Mandatory)
Before starting the research loop, the agent MUST ask the user these questions if not already answered in research.md. Do NOT assume — ask.
Question 1: Overnight Execution
Ask: "Do you want this research loop to run unattended (overnight)?"
If yes:
- Set
pause_every: never
- Recommend:
bash scripts/autoresearch-loop.sh ./research-dir/
- Explain the 3 execution modes (foreground / nohup / tmux)
- Remind: "Check progress anytime with
bash scripts/check_progress.sh"
If no:
- Ask: "How often should I pause for your review?" (every N iterations, or never)
- Set
pause_every accordingly
Question 2: Evaluator Setup
Ask: "Do you have a script that can automatically measure the success metric? (e.g., python evaluate.py that outputs JSON)"
If yes:
- Record the evaluator command in research.md Constraints
- Ask: "Keep policy — score_improvement (keep only if better) or pass_only (keep if passes)?"
If no:
- Agent will evaluate manually using available tools
- Note this in research.md:
Evaluator: _(none — agent judges manually)_
IMPORTANT: Do NOT start Stage 1 of the first iteration until pre-flight questions are answered. If research.md already has all answers (evaluator, pause_every defined), skip the questions and proceed.
Precondition Checks
Before the first iteration, verify the environment:
- Git repository: If in a git repo, check
git status — warn if there are uncommitted changes that might be lost during rollback. Suggest committing or stashing first.
- Stale lock file: If
.autoresearch.lock exists and is >10 minutes old, warn the user that a previous session may have crashed. Offer to delete the lock and continue.
- research.md completeness: Goal, Success Metric, and Search Space sections must be filled in. Refuse to start with placeholders (e.g.,
TBD, TODO).
Environment Detection
Before starting, detect your runtime capabilities and select the appropriate tier:
Check 1: Can I run Bash + Python?
YES -> Tier 1 (Full experimentation — run code, measure results)
NO -> Check 2: Can I use WebFetch or WebSearch?
YES -> Tier 2 (Research-only — literature review, web research)
NO -> Tier 3 (Analysis-only — work with user-provided data)
| Tier | Environment | Capabilities | Experimentation Method |
|---|
| Tier 1 | Claude Code, Codex CLI, any terminal | Bash + Python + full tool access | Run code, measure metrics, modify files, benchmark |
| Tier 2 | Claude App (Web) with web access | WebFetch + WebSearch | Web research, literature review, synthesis |
| Tier 3 | Fully restricted (no network, no shell) | Text generation only | Analyze user-provided data, propose hypotheses without executing |
How It Works
Five-stage loop, repeating until the success metric is met or constraints are exhausted:
[research.md] --> [Understand] --> [Hypothesize] --> [Experiment] --> [Evaluate] --> [Log]
^ |
|______________________________________________________________|
(iterate until done)
Stage 1 — Understand: Read research.md. Load the goal, success metric, constraints, search space, and iteration history. Assess current state: What has been tried? What worked? What failed? Where is the metric now relative to the target?
Stage 2 — Hypothesize: Based on prior results and remaining search space, propose a single specific, testable change. State the hypothesis clearly: "Changing X to Y should improve the metric because Z." Avoid repeating failed approaches unless the context has changed.
Stage 3 — Experiment: Execute the change. Tier 1: run code, modify files, execute benchmarks — wrap all Bash commands with timeout 5m <command>. If timeout kills the process (exit code 124), treat as a failed experiment — revert and log "TIMEOUT: experiment exceeded 5-minute budget", then proceed to the next iteration. Tier 2: search the web, fetch papers, gather data. Tier 3: apply analytical reasoning to user-provided data. Always preserve the ability to revert.
Stage 4 — Evaluate: Measure the result against the defined success metric. Compare to baseline and to the best result so far. Determine: improved, regressed, or no change? See evaluator-contract.md for mechanical evaluator details.
Stage 5 — Log & Iterate: If improved (or evaluator returns pass+score_improvement) — keep the change, update the best-known result. If not — revert the change, log the failure reason. In both cases: append a row to the History table in research.md, append detailed notes to research_log.md, append a row to autoresearch-results.tsv. After logging, update progress.png — a live convergence plot refreshed every iteration. Use rcparams() from scripts/style_presets.py before plotting. Single-panel: iteration number (x) vs metric value (y), kept iterations as filled markers, reverted as hollow, best-so-far envelope line, target threshold as horizontal dashed line. Overwrite progress.png each iteration. Then check termination conditions: (1) Target metric achieved? (2) Max iterations exhausted? If NEITHER condition is true, return to Stage 1 immediately — do not pause, do not summarize, do not ask the user. Begin the next iteration NOW.
Noise Handling
For metrics that are noisy (e.g., benchmarks, ML training), configure these optional fields in research.md Constraints:
noise_runs (default: 1): Number of runs to take the median of. Set to 3–5 for noisy benchmarks.
min_delta (default: 0): Minimum improvement required to count as "better". Prevents keeping noise-driven false positives. Example: min_delta: 0.01 means the metric must improve by at least 1% to be kept.
Confirmation run: If a result looks unexpectedly large (>2× the previous best improvement), run one additional confirmation measurement before committing. Log: "CONFIRMATION RUN: verifying unexpected improvement."
Guard Parameter
In addition to the success metric (what to optimize), you can define a guard — a hard safety constraint:
- Guard: A condition that must remain true at all times. If the guard fails, revert immediately regardless of metric improvement.
- Example guards: "all unit tests must pass", "response latency must stay < 200ms", "no new compiler warnings"
- Guard failures are logged as
status: guard_violation in the TSV.
- Unlike the metric (which allows trade-offs), the guard is absolute. A 50% metric improvement that fails the guard is reverted.
Optional: Mechanical Evaluator
See evaluator-contract.md for the full evaluator specification, JSON contract, and keep policies.
Quick reference:
- Add to
research.md Constraints: Evaluator: python evaluate.py
- Evaluator must output:
{"pass": true, "score": 0.94}
- Keep policies:
score_improvement (default) or pass_only
The research.md Format
The research.md file is both input and state. The user writes the top sections; the agent maintains the History table. See assets/research_template.md for the full template.
Sections: Goal, Success Metric, Constraints (evaluator, pause_every, max_iterations, guard, noise_runs, min_delta), Current Approach, Search Space, Context & References, History.
Output Structure
| File | Updated | Purpose |
|---|
research.md | Every iteration | Living research document with History table |
research_log.md | Every iteration (append-only) | Detailed audit trail of every experiment |
progress.png | Every iteration | Live convergence plot |
autoresearch-results.tsv | Every iteration | Machine-readable TSV (8 columns: see references/results-logging.md) |
final_report.md | End only | Structured summary with best result + recommendations |
Safety & Guardrails
max_iterations (default: 20) — Iteration budget. Aim to USE all iterations.
pause_every — Optional human review checkpoint. Default: never. Only set for safety-critical domains.
- Automatic rollback — Every experiment preserves the prior state. Failed experiments are reverted before the next iteration.
forbidden_changes — Hard boundaries defined in research.md. Never modify anything in this list.
- Time budget per experiment — Default: 5 minutes. Enforced via
timeout 5m <command>. Exit code 124 = timeout — treat as failed experiment, revert, and continue.
Stuck Detection & Pivot Protocol
See stuck-detection.md for the full Pivot Protocol.
Quick reference:
- Level 1 (3 consecutive non-improving): Switch to a different strategy. Continue iterating.
- Level 2 (5 consecutive non-improving): Radical paradigm shift. Continue iterating.
- Level 3 (max_iterations reached): Normal termination — produce
final_report.md.
Endgame Strategy
Normal mode (remaining iterations >= 2): Balance EXPLORE (new approaches) and EXPLOIT (refine best). Give new strategies at least 2 iterations before judging.
Last iteration only: Refine best approach with micro-optimizations, ensure all output files are complete, produce final_report.md.
Edge Cases
| Situation | Handling |
|---|
| No metric defined | Refuse to start. Ask user to define a measurable metric. |
| Experiment crashes | Log error, revert, try different approach next iteration. |
| Guard violation | Revert. Log as guard_violation. Metric improvement does not count. |
| Same metric for 3+ iterations | Shift strategy (Level 1 Pivot). |
| Max iterations reached | Produce final_report.md. Normal outcome, not failure. |
| Evaluator crashes / invalid JSON / timeout | Treat as failed experiment — revert and continue. |
| No search space left | Try combinations of kept changes. If truly exhausted, produce final_report.md. |
Overnight Runs
bash scripts/autoresearch-loop.sh ./my-research/
nohup bash scripts/autoresearch-loop.sh ./my-research/ > autoresearch.log 2>&1 &
tmux new-session -d -s research 'bash scripts/autoresearch-loop.sh ./my-research/'
bash scripts/check_progress.sh ./my-research/