Run any Skill in Manus with one click

autoresearch

Stars0

Forks0

UpdatedApril 5, 2026 at 16:58

Autonomous codebase improvement loop inspired by Karpathy's autoresearch. Iteratively improves code quality, test coverage, performance, and architecture using multi-metric evaluation with diminishing returns detection. USE WHEN user wants to iteratively improve a codebase, run autonomous code improvement, or apply the autoresearch pattern.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

dhanesh

dhanesh/autoresearch

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

SKILL.md

readonly

/autoresearch — Autonomous Codebase Improvement Loop

You are executing the autoresearch skill — an autonomous improve-evaluate-iterate loop that converges on measurable codebase improvements. Follow this protocol EXACTLY.

Arguments

Parse from ARGUMENTS string:

[scope] — File or directory path(s) to improve (default: auto-discover)
--profile <name> — Preset profile: quality, performance, coverage (default: interactive discovery)
--max-iterations <n> — Override max iterations (default: 20)
--time-box <seconds> — Override per-iteration time box (default: 120)
--resume — Resume a previous run from .autoresearch/state.json
--dry-run — Run discovery only, show what would be evaluated, don't loop

Phase 1: Constraint Discovery

Step 1.1: Codebase Introspection

Analyze the target codebase to detect its tooling:

1. Use Glob to find: package.json, tsconfig.json, pyproject.toml, Cargo.toml, go.mod
2. Use Glob to find: .eslintrc*, eslint.config.*, biome.json, .flake8, ruff.toml
3. Use Glob to find: vitest.config.*, jest.config.*, pytest.ini, conftest.py
4. Read detected config files to understand the setup
5. Count files in scope using Glob

Build a CodebaseProfile from findings.

Step 1.2: Profile Selection or Interactive Discovery

If --profile specified: Load the preset from profiles/<name>.json. Replace auto-detect commands with actual commands based on detected tooling.

If no profile: Use AskUserQuestion for hybrid discovery:

Present the codebase analysis findings
Propose default constraints based on detected tools
Ask user to accept/reject/modify each proposed constraint
Ask if they want to add custom evaluation commands
Ask for time box preference (suggest based on codebase size)
Ask for iteration cap preference

CRITICAL: Every accepted constraint MUST have a runnable command. Validate by executing each command once with a 10-second timeout. Report any that fail.

Step 1.3: Timing Calibration

Run each evaluation command once and measure duration:

time <command>

If total eval time exceeds 40% of the time box:

Show the timing breakdown to the user
Ask which mechanisms to keep via AskUserQuestion
Drop the slowest mechanisms that don't fit

Step 1.4: Command Registration

For each accepted constraint:

Record the exact command string
Compute SHA-256 hash: echo -n "<command>" | shasum -a 256
Store in the constraint config

These commands are now LOCKED. You MUST NOT modify, generate, or execute any command not in this registered set during the loop.

Step 1.5: Pre-Flight Permissions

Build the permission manifest and obtain all permissions BEFORE entering the loop:

1. Build permission manifest from finalized constraints + scope:
   - Bash permissions: one per registered eval command
   - Write permissions: one per scope path + .autoresearch/
   - Git permissions: checkout, add, commit, checkout --, diff, status

2. Group permissions by purpose and display the Pre-Flight Checklist:
   "Pre-Flight Permission Checklist:
    Evaluation (2 permissions):
      - bash: bun run lint (required, for eval-lint)
      - bash: bun test (required, for eval-tests)
    Improvement (2 permissions):
      - write: src/ (required, for loop)
      - read: src/ (required, for loop)
    Git (6 permissions):
      - bash: git checkout -b, git add, git commit, ...
    State (1 permission):
      - write: .autoresearch/ (required, for loop)"

3. Request ALL permissions at once (batch acquisition)

4. Verify each permission with a dry-run probe:
   - For bash: execute a no-op variant (e.g., `echo "probe"`)
   - For write: verify directory is writable
   - If any probe fails, the manifest is incomplete — fix before proceeding

5. Handle denials gracefully (O1, TN4):
   - Build fallback evaluator registry for all non-LLM constraints
   - If a Bash eval permission is denied:
     → Activate the LLM-based fallback evaluator for that axis
     → Rebalance weights across remaining constraints
   - If a core permission is denied (git, scope writes):
     → STOP — cannot run without core permissions
   - Display adapted constraint set to user

CRITICAL: After this step, no new permission prompts should occur during the loop. If one does, it indicates a manifest gap — log it as a bug.

Step 1.6: Scoring Configuration

Configure the phase-adaptive composite scoring:

1. If profile has a `scoring` field, use it. Otherwise use defaults:
   - method: "arithmetic" (early phase: broad improvement)
   - phaseTransitionMethod: "harmonic" (late phase: enforce balance)
   - phaseTransitionPct: 0.4 (switch at 40% of max_iterations)
   - phaseTransitionScoreThreshold: 80 (or switch when all axes > 80)

2. Log the scoring configuration in state.json

Phase 2: Baseline Capture

1. Create git branch: autoresearch/<timestamp>-<scope>
   Command: git checkout -b "autoresearch/$(date +%Y%m%dT%H%M%S)-<scope-slug>"

2. Create .autoresearch/ directory for state files

3. Run ALL registered evaluation commands
   - Execute each command with timeout wrapper
   - Normalize each result to 0-100 score
   - Calculate weighted composite using phaseAdaptiveComposite()
     (iteration 0, so arithmetic mean with constraint weights)
   - Track tokens used for baseline phase: addPhaseTokens("baseline", tokens)
   - Record as baseline in state.json

4. Display baseline scores to user:
   "Baseline captured:
    - lint: 72/100
    - tests: 85/100 (pass rate)
    - llm-quality: 65/100
    Composite: 74.0"

5. Write initial state to .autoresearch/state.json
6. Commit: "autoresearch: baseline capture"

Phase 3: The Loop

FOR EACH ITERATION (until stop condition met):

  ┌─ CHECK STOP CONDITIONS ─────────────────────────────┐
  │ 1. Current iteration >= max_iterations? → STOP      │
  │ 2. Total tokens >= token_budget? → STOP              │
  │ 3. Wall-clock elapsed >= total_timeout? → STOP       │
  │ 4. Plateau counter >= plateau_window (3)? → STOP     │
  └──────────────────────────────────────────────────────┘

  ┌─ IMPROVE ───────────────────────────────────────────┐
  │ Read the files in scope.                             │
  │ Analyze for improvements based on active constraints.│
  │ Apply ONE focused improvement per iteration.         │
  │ Keep changes small and atomic.                       │
  │                                                      │
  │ SCOPE ENFORCEMENT (S2):                              │
  │ - You may READ any file in the project               │
  │ - You may ONLY WRITE to files within the scope       │
  │ - If you identify an out-of-scope improvement,       │
  │   log it as a ScopeProposal, do NOT implement it     │
  └──────────────────────────────────────────────────────┘

  ┌─ EVALUATE ──────────────────────────────────────────┐
  │ Run registered evaluation commands:                  │
  │                                                      │
  │ COMMAND SAFETY (S3, TN5):                            │
  │ - ONLY execute commands from the registered set      │
  │ - Verify hash before each execution                  │
  │ - Wrap each with: timeout <seconds> <command>        │
  │                                                      │
  │ PARALLELISM (T6):                                    │
  │ - Run independent eval commands in parallel (default)│
  │ - If any command fails/flakes, re-run sequentially   │
  │                                                      │
  │ ADAPTIVE LLM EVAL SCHEDULING (T5, TN3):             │
  │ - Compute volatility from recent iteration deltas    │
  │ - High volatility (>2.0): run FULL 4-dimension eval  │
  │ - Low volatility (<0.5): run LITE 1-dimension probe  │
  │   (1/4 token cost, maintains axis coverage)          │
  │ - Minimum guarantee: full eval every 5th iteration   │
  │ - Always full eval on final iteration                │
  │ - Carry forward previous score for skipped dims      │
  │                                                      │
  │ TOKEN TRACKING (T4):                                 │
  │ - Track tokens per eval command: addPhaseTokens()    │
  │ - Track per-constraint token usage per iteration     │
  │                                                      │
  │ COMPOSITE SCORING (T3, T8, TN5):                     │
  │ - Normalize all results to 0-100                     │
  │ - Calculate composite via phaseAdaptiveComposite():  │
  │   First 40% iterations: weighted arithmetic mean     │
  │   After 40% or all axes >80: weighted harmonic mean  │
  │ - Log aggregation method used for this iteration     │
  └──────────────────────────────────────────────────────┘

  ┌─ DECIDE ────────────────────────────────────────────┐
  │ Compare composite to previous iteration:             │
  │                                                      │
  │ IF improved (composite > previous):                  │
  │   → git add + git commit "autoresearch: iteration N" │
  │   → Update best scores                               │
  │   → Reset plateau counter                            │
  │                                                      │
  │ IF regressed (composite <= previous):                │
  │   → git checkout -- . (revert all changes)           │
  │   → Log as "reverted"                                │
  │   → Increment plateau counter                        │
  │                                                      │
  │ CIRCUIT BREAKER (O3, TN4):                           │
  │ IF any metric regressed >10% from best:              │
  │   → Re-run THAT metric sequentially (TN4: retry)     │
  │   → IF regression CONFIRMED on retry:                │
  │     → Revert changes                                 │
  │     → STOP loop with "circuit_breaker" reason        │
  │     → Report the regression details                  │
  │   → IF regression NOT confirmed (was flaky):         │
  │     → Continue normal flow                           │
  └──────────────────────────────────────────────────────┘

  ┌─ TRACK ─────────────────────────────────────────────┐
  │ Update .autoresearch/state.json with:                │
  │ - Iteration scores, delta, status                    │
  │ - Cumulative tokens used + per-phase breakdown (T4)  │
  │ - Per-constraint token usage this iteration          │
  │ - Plateau counter                                    │
  │ - Best scores                                        │
  │ - Aggregation method used (arithmetic/harmonic)      │
  │ - Eval decision (full/lite/skip) and volatility      │
  │                                                      │
  │ Display enhanced progress:                           │
  │ "[autoresearch] Iteration 5/20 | Score: 78 (+2.1)   │
  │  | Method: arithmetic | Eval: lite | Vol: 0.3        │
  │  | Tokens: 45000 (eval: 12k, improve: 33k)          │
  │  | Elapsed: 340s"                                    │
  └──────────────────────────────────────────────────────┘

END FOR

Phase 4: Report Generation

When the loop stops (for ANY reason):

1. Track reporting phase tokens: addPhaseTokens("reporting", ...)

2. Run FULL LLM evaluation (not sampled) on all changes since baseline
   - This is the learning report (TN7)
   - Include WHY each change improves the code
   - Include what patterns were applied

3. Compute analytics (RT-6, B4):
   - Token consumption dashboard: per-phase breakdown, cost estimate,
     tokens-per-improvement-point efficiency ratio, points-per-dollar
   - Score confidence intervals: compute CI from LLM rubric dimension
     scores across all evaluated iterations (U3)
   - Trajectory analysis: fit diminishing returns curve, predict quality
     ceiling, identify optimal stop iteration, show aggregation phase
     transition point (U4)
   - Effect size: compute Cohen's d between baseline and final scores

4. Generate .autoresearch/report.md with (using buildProductionReport):
   - Run metadata (scope, duration, iterations, stop reason)
   - Per-constraint improvement table (baseline → final, % change)
   - Iteration history (score, delta, status, method, eval decision)
   - **Token Consumption Dashboard** (U2):
     Per-phase breakdown table, estimated cost, efficiency ratio,
     per-iteration token usage with cumulative tracking
   - **Score Confidence** (U3):
     Per-constraint CI table with mean, 95% bounds, std dev
   - **Trajectory Analysis** (U4):
     ASCII chart with predicted ceiling, optimal stop, decay rate,
     aggregation phase transition marker
   - Convergence analysis (if stopped due to diminishing returns)
   - Out-of-scope proposals (if any)
   - Full learning report from LLM evaluation

5. Display enhanced convergence message (U3):
   "Autoresearch complete:
    - Iterations: 12/20
    - Stop reason: diminishing returns (3 consecutive < 0.5 delta)
    - Total improvement: +8.3 points (74.0 → 82.3)
    - Predicted ceiling: 84.1 (headroom: 1.8 pts)
    - Optimal stop was: iteration 9 (best ROI)
    - Token usage: 120k (eval: 35k, improve: 75k, report: 10k)
    - Estimated cost: $0.84 (7.0 points per dollar)
    - Report: .autoresearch/report.md
    - Branch: autoresearch/20260324T202006-src"

6. Commit the report: "autoresearch: final report"

Safety Rules (NON-NEGOTIABLE)

NEVER modify files outside the defined scope (S2)
NEVER execute commands not in the registered set (S3)
NEVER force-push, delete branches, or rewrite history (S1)
NEVER continue after circuit breaker trips (O3)
NEVER exceed the iteration cap (O1)
ALWAYS verify command hashes before execution (TN5)
ALWAYS wrap commands with timeout (RT-5)
ALWAYS commit before and after changes (T1, B1)

Resume Protocol

When --resume is passed:

1. Read .autoresearch/state.json
2. Verify the autoresearch branch still exists
3. Checkout the branch
4. Display current state (iteration count, scores, budget remaining)
5. Continue the loop from where it left off

Error Handling

Error	Action
Evaluation command fails	Score 0 for that constraint, continue
Evaluation command times out	Score 0, log timeout, continue
Git commit fails	Log error, skip iteration, continue
All evaluations fail	STOP loop, report error
State file corrupted	STOP loop, report last known good state

name	autoresearch
description	Autonomous codebase improvement loop inspired by Karpathy's autoresearch. Iteratively improves code quality, test coverage, performance, and architecture using multi-metric evaluation with diminishing returns detection. USE WHEN user wants to iteratively improve a codebase, run autonomous code improvement, or apply the autoresearch pattern.

/autoresearch — Autonomous Codebase Improvement Loop

You are executing the autoresearch skill — an autonomous improve-evaluate-iterate loop that converges on measurable codebase improvements. Follow this protocol EXACTLY.

Arguments

Parse from ARGUMENTS string:

[scope] — File or directory path(s) to improve (default: auto-discover)
--profile <name> — Preset profile: quality, performance, coverage (default: interactive discovery)
--max-iterations <n> — Override max iterations (default: 20)
--time-box <seconds> — Override per-iteration time box (default: 120)
--resume — Resume a previous run from .autoresearch/state.json
--dry-run — Run discovery only, show what would be evaluated, don't loop

Phase 1: Constraint Discovery

Step 1.1: Codebase Introspection

Analyze the target codebase to detect its tooling:

1. Use Glob to find: package.json, tsconfig.json, pyproject.toml, Cargo.toml, go.mod
2. Use Glob to find: .eslintrc*, eslint.config.*, biome.json, .flake8, ruff.toml
3. Use Glob to find: vitest.config.*, jest.config.*, pytest.ini, conftest.py
4. Read detected config files to understand the setup
5. Count files in scope using Glob

Build a CodebaseProfile from findings.

Step 1.2: Profile Selection or Interactive Discovery

If --profile specified: Load the preset from profiles/<name>.json. Replace auto-detect commands with actual commands based on detected tooling.

If no profile: Use AskUserQuestion for hybrid discovery:

Present the codebase analysis findings
Propose default constraints based on detected tools
Ask user to accept/reject/modify each proposed constraint
Ask if they want to add custom evaluation commands
Ask for time box preference (suggest based on codebase size)
Ask for iteration cap preference

CRITICAL: Every accepted constraint MUST have a runnable command. Validate by executing each command once with a 10-second timeout. Report any that fail.

Step 1.3: Timing Calibration

Run each evaluation command once and measure duration:

time <command>

If total eval time exceeds 40% of the time box:

Show the timing breakdown to the user
Ask which mechanisms to keep via AskUserQuestion
Drop the slowest mechanisms that don't fit

Step 1.4: Command Registration

For each accepted constraint:

Record the exact command string
Compute SHA-256 hash: echo -n "<command>" | shasum -a 256
Store in the constraint config

These commands are now LOCKED. You MUST NOT modify, generate, or execute any command not in this registered set during the loop.

Step 1.5: Pre-Flight Permissions

Build the permission manifest and obtain all permissions BEFORE entering the loop:

1. Build permission manifest from finalized constraints + scope:
   - Bash permissions: one per registered eval command
   - Write permissions: one per scope path + .autoresearch/
   - Git permissions: checkout, add, commit, checkout --, diff, status

2. Group permissions by purpose and display the Pre-Flight Checklist:
   "Pre-Flight Permission Checklist:
    Evaluation (2 permissions):
      - bash: bun run lint (required, for eval-lint)
      - bash: bun test (required, for eval-tests)
    Improvement (2 permissions):
      - write: src/ (required, for loop)
      - read: src/ (required, for loop)
    Git (6 permissions):
      - bash: git checkout -b, git add, git commit, ...
    State (1 permission):
      - write: .autoresearch/ (required, for loop)"

3. Request ALL permissions at once (batch acquisition)

4. Verify each permission with a dry-run probe:
   - For bash: execute a no-op variant (e.g., `echo "probe"`)
   - For write: verify directory is writable
   - If any probe fails, the manifest is incomplete — fix before proceeding

5. Handle denials gracefully (O1, TN4):
   - Build fallback evaluator registry for all non-LLM constraints
   - If a Bash eval permission is denied:
     → Activate the LLM-based fallback evaluator for that axis
     → Rebalance weights across remaining constraints
   - If a core permission is denied (git, scope writes):
     → STOP — cannot run without core permissions
   - Display adapted constraint set to user

CRITICAL: After this step, no new permission prompts should occur during the loop. If one does, it indicates a manifest gap — log it as a bug.

Step 1.6: Scoring Configuration

Configure the phase-adaptive composite scoring:

1. If profile has a `scoring` field, use it. Otherwise use defaults:
   - method: "arithmetic" (early phase: broad improvement)
   - phaseTransitionMethod: "harmonic" (late phase: enforce balance)
   - phaseTransitionPct: 0.4 (switch at 40% of max_iterations)
   - phaseTransitionScoreThreshold: 80 (or switch when all axes > 80)

2. Log the scoring configuration in state.json

Phase 2: Baseline Capture

1. Create git branch: autoresearch/<timestamp>-<scope>
   Command: git checkout -b "autoresearch/$(date +%Y%m%dT%H%M%S)-<scope-slug>"

2. Create .autoresearch/ directory for state files

3. Run ALL registered evaluation commands
   - Execute each command with timeout wrapper
   - Normalize each result to 0-100 score
   - Calculate weighted composite using phaseAdaptiveComposite()
     (iteration 0, so arithmetic mean with constraint weights)
   - Track tokens used for baseline phase: addPhaseTokens("baseline", tokens)
   - Record as baseline in state.json

4. Display baseline scores to user:
   "Baseline captured:
    - lint: 72/100
    - tests: 85/100 (pass rate)
    - llm-quality: 65/100
    Composite: 74.0"

5. Write initial state to .autoresearch/state.json
6. Commit: "autoresearch: baseline capture"

Phase 3: The Loop

FOR EACH ITERATION (until stop condition met):

  ┌─ CHECK STOP CONDITIONS ─────────────────────────────┐
  │ 1. Current iteration >= max_iterations? → STOP      │
  │ 2. Total tokens >= token_budget? → STOP              │
  │ 3. Wall-clock elapsed >= total_timeout? → STOP       │
  │ 4. Plateau counter >= plateau_window (3)? → STOP     │
  └──────────────────────────────────────────────────────┘

  ┌─ IMPROVE ───────────────────────────────────────────┐
  │ Read the files in scope.                             │
  │ Analyze for improvements based on active constraints.│
  │ Apply ONE focused improvement per iteration.         │
  │ Keep changes small and atomic.                       │
  │                                                      │
  │ SCOPE ENFORCEMENT (S2):                              │
  │ - You may READ any file in the project               │
  │ - You may ONLY WRITE to files within the scope       │
  │ - If you identify an out-of-scope improvement,       │
  │   log it as a ScopeProposal, do NOT implement it     │
  └──────────────────────────────────────────────────────┘

  ┌─ EVALUATE ──────────────────────────────────────────┐
  │ Run registered evaluation commands:                  │
  │                                                      │
  │ COMMAND SAFETY (S3, TN5):                            │
  │ - ONLY execute commands from the registered set      │
  │ - Verify hash before each execution                  │
  │ - Wrap each with: timeout <seconds> <command>        │
  │                                                      │
  │ PARALLELISM (T6):                                    │
  │ - Run independent eval commands in parallel (default)│
  │ - If any command fails/flakes, re-run sequentially   │
  │                                                      │
  │ ADAPTIVE LLM EVAL SCHEDULING (T5, TN3):             │
  │ - Compute volatility from recent iteration deltas    │
  │ - High volatility (>2.0): run FULL 4-dimension eval  │
  │ - Low volatility (<0.5): run LITE 1-dimension probe  │
  │   (1/4 token cost, maintains axis coverage)          │
  │ - Minimum guarantee: full eval every 5th iteration   │
  │ - Always full eval on final iteration                │
  │ - Carry forward previous score for skipped dims      │
  │                                                      │
  │ TOKEN TRACKING (T4):                                 │
  │ - Track tokens per eval command: addPhaseTokens()    │
  │ - Track per-constraint token usage per iteration     │
  │                                                      │
  │ COMPOSITE SCORING (T3, T8, TN5):                     │
  │ - Normalize all results to 0-100                     │
  │ - Calculate composite via phaseAdaptiveComposite():  │
  │   First 40% iterations: weighted arithmetic mean     │
  │   After 40% or all axes >80: weighted harmonic mean  │
  │ - Log aggregation method used for this iteration     │
  └──────────────────────────────────────────────────────┘

  ┌─ DECIDE ────────────────────────────────────────────┐
  │ Compare composite to previous iteration:             │
  │                                                      │
  │ IF improved (composite > previous):                  │
  │   → git add + git commit "autoresearch: iteration N" │
  │   → Update best scores                               │
  │   → Reset plateau counter                            │
  │                                                      │
  │ IF regressed (composite <= previous):                │
  │   → git checkout -- . (revert all changes)           │
  │   → Log as "reverted"                                │
  │   → Increment plateau counter                        │
  │                                                      │
  │ CIRCUIT BREAKER (O3, TN4):                           │
  │ IF any metric regressed >10% from best:              │
  │   → Re-run THAT metric sequentially (TN4: retry)     │
  │   → IF regression CONFIRMED on retry:                │
  │     → Revert changes                                 │
  │     → STOP loop with "circuit_breaker" reason        │
  │     → Report the regression details                  │
  │   → IF regression NOT confirmed (was flaky):         │
  │     → Continue normal flow                           │
  └──────────────────────────────────────────────────────┘

  ┌─ TRACK ─────────────────────────────────────────────┐
  │ Update .autoresearch/state.json with:                │
  │ - Iteration scores, delta, status                    │
  │ - Cumulative tokens used + per-phase breakdown (T4)  │
  │ - Per-constraint token usage this iteration          │
  │ - Plateau counter                                    │
  │ - Best scores                                        │
  │ - Aggregation method used (arithmetic/harmonic)      │
  │ - Eval decision (full/lite/skip) and volatility      │
  │                                                      │
  │ Display enhanced progress:                           │
  │ "[autoresearch] Iteration 5/20 | Score: 78 (+2.1)   │
  │  | Method: arithmetic | Eval: lite | Vol: 0.3        │
  │  | Tokens: 45000 (eval: 12k, improve: 33k)          │
  │  | Elapsed: 340s"                                    │
  └──────────────────────────────────────────────────────┘

END FOR

Phase 4: Report Generation

When the loop stops (for ANY reason):

1. Track reporting phase tokens: addPhaseTokens("reporting", ...)

2. Run FULL LLM evaluation (not sampled) on all changes since baseline
   - This is the learning report (TN7)
   - Include WHY each change improves the code
   - Include what patterns were applied

3. Compute analytics (RT-6, B4):
   - Token consumption dashboard: per-phase breakdown, cost estimate,
     tokens-per-improvement-point efficiency ratio, points-per-dollar
   - Score confidence intervals: compute CI from LLM rubric dimension
     scores across all evaluated iterations (U3)
   - Trajectory analysis: fit diminishing returns curve, predict quality
     ceiling, identify optimal stop iteration, show aggregation phase
     transition point (U4)
   - Effect size: compute Cohen's d between baseline and final scores

4. Generate .autoresearch/report.md with (using buildProductionReport):
   - Run metadata (scope, duration, iterations, stop reason)
   - Per-constraint improvement table (baseline → final, % change)
   - Iteration history (score, delta, status, method, eval decision)
   - **Token Consumption Dashboard** (U2):
     Per-phase breakdown table, estimated cost, efficiency ratio,
     per-iteration token usage with cumulative tracking
   - **Score Confidence** (U3):
     Per-constraint CI table with mean, 95% bounds, std dev
   - **Trajectory Analysis** (U4):
     ASCII chart with predicted ceiling, optimal stop, decay rate,
     aggregation phase transition marker
   - Convergence analysis (if stopped due to diminishing returns)
   - Out-of-scope proposals (if any)
   - Full learning report from LLM evaluation

5. Display enhanced convergence message (U3):
   "Autoresearch complete:
    - Iterations: 12/20
    - Stop reason: diminishing returns (3 consecutive < 0.5 delta)
    - Total improvement: +8.3 points (74.0 → 82.3)
    - Predicted ceiling: 84.1 (headroom: 1.8 pts)
    - Optimal stop was: iteration 9 (best ROI)
    - Token usage: 120k (eval: 35k, improve: 75k, report: 10k)
    - Estimated cost: $0.84 (7.0 points per dollar)
    - Report: .autoresearch/report.md
    - Branch: autoresearch/20260324T202006-src"

6. Commit the report: "autoresearch: final report"

Safety Rules (NON-NEGOTIABLE)

NEVER modify files outside the defined scope (S2)
NEVER execute commands not in the registered set (S3)
NEVER force-push, delete branches, or rewrite history (S1)
NEVER continue after circuit breaker trips (O3)
NEVER exceed the iteration cap (O1)
ALWAYS verify command hashes before execution (TN5)
ALWAYS wrap commands with timeout (RT-5)
ALWAYS commit before and after changes (T1, B1)

Resume Protocol

When --resume is passed:

1. Read .autoresearch/state.json
2. Verify the autoresearch branch still exists
3. Checkout the branch
4. Display current state (iteration count, scores, budget remaining)
5. Continue the loop from where it left off

Error Handling

Error	Action
Evaluation command fails	Score 0 for that constraint, continue
Evaluation command times out	Score 0, log timeout, continue
Git commit fails	Log error, skip iteration, continue
All evaluations fail	STOP loop, report error
State file corrupted	STOP loop, report last known good state

autoresearch

More from this repository

More from this repository

/autoresearch — Autonomous Codebase Improvement Loop

Arguments

Phase 1: Constraint Discovery

Step 1.1: Codebase Introspection

Step 1.2: Profile Selection or Interactive Discovery

Step 1.3: Timing Calibration

Step 1.4: Command Registration

Step 1.5: Pre-Flight Permissions

Step 1.6: Scoring Configuration

Phase 2: Baseline Capture

Phase 3: The Loop

Phase 4: Report Generation

Safety Rules (NON-NEGOTIABLE)

Resume Protocol

Error Handling

/autoresearch — Autonomous Codebase Improvement Loop

Arguments

Phase 1: Constraint Discovery

Step 1.1: Codebase Introspection

Step 1.2: Profile Selection or Interactive Discovery

Step 1.3: Timing Calibration

Step 1.4: Command Registration

Step 1.5: Pre-Flight Permissions

Step 1.6: Scoring Configuration

Phase 2: Baseline Capture

Phase 3: The Loop

Phase 4: Report Generation

Safety Rules (NON-NEGOTIABLE)

Resume Protocol

Error Handling