| name | autoresearch |
| description | Autonomously optimize any Codex skill or agent system by running it repeatedly, scoring outputs against evals (binary for rules + comparative for quality), mutating any owned artifact — the skill's prompt, reference assets, and executable artifacts (scripts, agent/subagent definitions like `agents/openai.yaml`, MCP servers, hooks, harness code) — and keeping improvements. Based on Karpathy's autoresearch methodology. Use this skill whenever the user mentions optimizing a skill, improving a skill or agent, running autoresearch, making a skill or agent better, self-improving a skill, benchmarking a skill, evaluating a skill, running evals on a skill, optimizing an agent system, or any request to iteratively test and refine a skill or agent — even if they don't use the word "autoresearch" explicitly. Also trigger on 스킬 개선, 스킬 최적화, 스킬 벤치마크, 스킬 평가, 에이전트 개선, 에이전트 최적화. Outputs an improved target skill file (and any mutated executable artifacts), a results log, a changelog, and a research log of meaningful direction shifts. |
| metadata | {"short-description":"Eval-driven optimization for Codex skills and pipelines"} |
Autoresearch for Skills
Most skills work about 70% of the time. The remaining 30% is where vague instructions, weak examples, and brittle rules show up. The fix is not "rewrite it from scratch." The fix is to run the skill repeatedly, score the outputs, mutate the skill, and keep only what measurably helps.
This skill adapts Andrej Karpathy's autoresearch methodology to Codex skills and agent systems. Karpathy mutated ML training code; here we mutate every artifact a skill owns: SKILL.md prose, reference assets in references/, and executable artifacts the skill invokes (scripts, agent/subagent definitions like agents/openai.yaml, MCP servers, hooks, harness code).
The Core Job
Take an existing skill, define what good output looks like, then run a loop that:
- Generates outputs from the skill using test inputs
- Scores each output against eval criteria
- Mutates any owned artifact — SKILL.md prose (L1), reference assets in
references/ (L2a), executable artifacts the skill invokes such as scripts, agent/subagent definitions, MCP servers, hooks, harness code (L2b), or eval criteria (L3)
- Keeps improvements and discards regressions
- Repeats until a stop condition is reached
Expected outputs for the target skill path:
- improved target skill file
results.tsv
changelog.md
research-log.json
dashboard.html
Before Starting: Gather the Experiment Contract
Do not block on a perfect spec, but do establish a minimum viable experiment contract.
- Target skill(s) — exact path to the target
SKILL.md; for a pipeline, list all skills in order
- Pipeline mode — single skill or multi-skill pipeline; default is single
- Owned executable artifacts — list all scripts, tool implementations, agent definitions (e.g.,
agents/openai.yaml), MCP servers, hooks, and harness files the skill invokes. Default: in-scope as L2b mutation candidates unless the user excludes them. If the skill is pure prompt + static references, the list is empty and L2b is unused. See references/mutation-guide.md.
- Test inputs — 3-5 prompts or scenarios; if missing, draft a starter set and state the assumption
- Eval criteria — 3-6 binary checks plus 1-2 comparative quality checks where possible
- Runs per experiment — default: 3 for light skills, 1-2 for heavy skills, 5 only when cheap
- Budget cap — default: 5 total experiments in one Codex turn unless the user wants more
- Termination conditions — default: stop at budget cap or after 95%+ binary pass rate for 3 consecutive accepted experiments
- Human review mode — default: review baseline plus the first meaningful keep; use
skip only when the user explicitly wants unattended mode
- Execution mode — default: sequential in the current agent; use subagents only if the user explicitly asks for delegation or parallel agent work
- Run harness — define the exact repeatable command or workflow that constitutes "running the skill"
- Versioning mode — default:
git-assisted when a clean local git workflow is practical, otherwise file-checkpoint
If the user provides an evals.json, use that instead of drafting items 4-5.
Execution mode rules:
- Use
sequential by default
- Do not spawn subagents unless the user explicitly authorized delegation
- Treat "run the skill" as an explicit harness, not a vague conversational attempt
- If fresh runs are unavailable, continue sequentially and note context contamination risk
- Do not assume git is available or appropriate; decide versioning mode explicitly before baseline
Step 1: Read the Target Skill
Before changing anything:
- Read the target skill file at the exact path captured in the experiment contract
- Read linked files in
references/ (L2a candidates) and helpers in scripts/ (L2b candidates)
- Read every executable artifact the skill invokes — scripts, tool implementations, agent/subagent definitions, MCP servers, hook scripts (L2b candidates). Use the list captured in the experiment contract item 3 as the starting set; verify by tracing every command and tool call the skill performs
- Identify the core job, workflow, output format, and anti-patterns
- Note any existing quality checks already embedded in the skill
- If the target is a Codex skill and
agents/openai.yaml exists, read it and verify the UI metadata still matches the skill — note that agents/openai.yaml is an L2b mutation candidate
Step 2: Design the Eval Suite
Use the eval guidance in references/eval-guide.md.
Three eval types are allowed:
- Binary evals — pass/fail rule compliance
- Comparative evals — win/tie/loss on subjective quality dimensions
- Fidelity evals — pipeline stage consistency in multi-skill mode
Scoring:
- Binary: pass = 1, fail = 0
- Comparative: win = 1, tie = 0.5, loss = 0
- Fidelity: pass = 1, fail = 0
max_score = total assertions x runs per experiment
Use the highest-determinism eval you can. LLM-as-judge is acceptable only when the rubric is explicit enough for repeatable scoring.
Eval Determinism Hierarchy
When designing evals, prefer the highest-determinism check available.
Tier 1 - Deterministic checks
- regex, required section presence, file existence
- JSON/YAML parse success
- character-count or item-count bounds
Tier 2 - Structural validation
- heading hierarchy
- table shape consistency
- code block formatting or schema-level structure
Tier 3 - LLM-as-judge
- tone, usefulness, completeness, quality, or other subjective criteria that cannot be checked programmatically
Target at least 50% of the eval suite to be Tier 1-2. If most evals are Tier 3, the loop becomes too noisy to trust.
Eval Quality Check
Before locking the eval suite, run this 3-question test on each eval:
- Would two different reviewers likely score the same output the same way?
- Could the skill game this check without becoming genuinely better?
- Does this check capture something the user actually cares about?
If any answer is weak, tighten or replace the eval.
Step 3: Define the Run Harness
Each experiment needs a repeatable harness that executes the skill and collects outputs.
Acceptable harnesses:
- A local script or command that runs the workflow end to end
- A bounded manual protocol with fixed prompt, fixed output path, and deterministic artifact capture
- A delegated subagent task only when the user explicitly approved delegation
Before baseline, write the harness into run-harness.md inside the run folder so future experiments are comparable.
Also create the live dashboard before running experiments. Follow references/dashboard-guide.md and create dashboard.html as a self-contained file that is updated by inlining the latest results.json data after each experiment.
If you cannot define a trustworthy harness, stop calling it autoresearch and switch to "skill rewrite + manual review" mode.
See references/execution-guide.md.
Step 4: Establish Baseline
If autoresearch-[skill-name]/ already exists, skip baseline creation and go to Resuming a Previous Run.
Baseline is experiment #0.
- Create
autoresearch-[skill-name]/ and runs/baseline/
- Create
results.json, results.tsv, changelog.md, research-log.json, dashboard.html, and run-harness.md
- Back up the original skill as
<target-skill-filename>.baseline in the run folder
- Run the skill as-is with the selected test inputs
- Copy all outputs into
runs/baseline/<prompt-id>/
- Score every output against every eval
- Record the baseline score
- If versioning mode is
git-assisted, create a git branch: autoresearch/[skill-name] (add -N suffix if needed)
- If versioning mode is
git-assisted and the run folder should stay untracked, add the autoresearch folder to .gitignore only when that is safe for the repo
- Persist the baseline snapshot:
git-assisted: commit the baseline skill files explicitly by path, for example
git add <target-skill-path> .gitignore && git commit -m "autoresearch: baseline ([score]/[max])"
file-checkpoint: record baseline hashes in run-harness.md and keep the copied .baseline file as the restore source
After baseline, choose one mode explicitly:
interactive: report baseline and wait for approval
unattended: continue until budget or stop condition is hit
Default is interactive unless the user explicitly requested unattended looping.
Step 5: Human Review Phase
Skip this only when the user explicitly set human review mode to skip.
In Codex, human review is usually bounded:
- review baseline
- review the first meaningful keep
- expand to 2-3 reviewed experiments only when the skill has strong subjective quality dimensions
For each reviewed experiment:
- Analyze failures and form one clear hypothesis
- Make one bounded change
- Commit only the files mutated in that experiment
- Run the experiment and score it
- Present before/after score plus 2-3 representative outputs
- Ask whether the direction feels right or whether the evals miss something
- Log human insight as
[HUMAN INSIGHT] in changelog.md
- Keep or discard using the rollback rules from step 6
- Mark the result as
human-reviewed
Only switch to unattended mode if the user explicitly approves it.
Step 6: Run the Mutation Loop
Autonomy in Codex is batch-based, not open-ended.
Run unattended only when all of these are true:
- the user explicitly asked for unattended auto mode
- the run harness is reliable
- rollback is safe for touched files
- budget cap and stop conditions are written down
Otherwise run a bounded batch and report results.
Loop steps:
- Analyze failing evals and inspect real failing outputs
- Form one mutation hypothesis at the right level (L1 prompt rules / L2a reference assets / L2b executable artifacts the skill invokes / L3 eval calibration — see
references/mutation-guide.md). If the failure stems from execution (wrong format, build error, tool returns garbage, subagent malformed result), choose L2b directly — do not paper over execution bugs with more prompt rules.
- Checkpoint only the files you plan to touch
- Make the bounded change
- L2b mutations explicitly include changes to
agents/openai.yaml, helper scripts in scripts/, MCP server code, hook scripts, and any subagent definitions. If a Codex skill's user-facing purpose changes, update agents/openai.yaml too.
- Persist the mutation:
git-assisted: git add <mutated-files> && git commit -m "autoresearch: [description]"
file-checkpoint: save explicit pre-mutation copies or hashes for each touched file inside the run folder before evaluation
- Run the experiment and save every produced artifact under
runs/exp-N/<prompt-id>/
- Score the outputs
- Decide
KEEP or DISCARD
- Log the result
- If this changed direction meaningfully, record it in
research-log.json
- Repeat until batch budget is exhausted or a stop condition is hit
Mutation Safety Rules
- Each mutation is checkpointed before evaluation
KEEP -> the accepted version becomes the new baseline
DISCARD -> use a non-destructive rollback that matches the selected versioning mode:
git-assisted: git reset --soft HEAD~1, then restore only the checkpointed files to their pre-experiment contents by explicit path
file-checkpoint: restore only the touched files from the saved pre-mutation copies or the latest accepted baseline copies
- Never use
git reset --hard or any broad revert that can destroy unrelated user changes
- If the repo was already dirty, record which files were pre-modified and exclude unrelated work from rollback
- If git is unavailable, unwanted, or unsafe for the current repo, stay in
file-checkpoint mode for the whole run
KEEP vs DISCARD Rules
Record skill_lines for every experiment with wc -l <target-skill-path>.
Use these defaults unless the user defined a stricter rule:
| Score change | Prompt size change | Default decision |
|---|
| Improved meaningfully | Any reasonable increase | KEEP |
| Improved marginally | Large prompt increase | DISCARD unless the gain fixes an important failure |
| Flat | Shorter or simpler prompt | KEEP |
| Flat | Longer or more complex prompt | DISCARD |
| Worse | Any | DISCARD |
Additional guardrails:
- If an eval that previously passed now fails, treat that as a regression and strongly prefer
DISCARD even if the total score rose
- When two versions score the same, prefer the shorter and simpler one
- If the mutation changes user-facing behavior, compare representative outputs before keeping it
Deletion Experiments
Every 5th experiment, try a deletion mutation.
- Remove recently added rules that may not contribute to score
- If the score holds, keep the deletion
- If the target skill file grows past 200% of baseline size, record a bloat warning
Stop Conditions
Stop when any of these is true:
- budget cap reached
- 95%+ binary pass rate sustained for 3 consecutive accepted experiments
- user stops the run
- system resource or time limit reached
- the run harness is no longer trustworthy
Running out of ideas is not a reason to stop; change mutation level or revisit eval design.
Step 7: Logging Rules
Use references/logging-guide.md.
At minimum:
results.tsv records experiment number, score, keep/discard, and concise rationale
changelog.md records each mutation, per-eval movement, and human insights
research-log.json stores direction shifts only, not every micro edit
dashboard.html should visualize baseline-to-current progress without external runtime dependencies when possible
Step 8: Report Back
When the loop pauses or ends, present:
- Score summary: baseline -> final
- Total experiments tried
- Keep rate
- Top 3 helpful changes
- Human insights incorporated
- Remaining failure patterns
- Direction shifts
- Prompt size change
- Output file locations
- Accepted git history
Step 9: Next Steps
Autoresearch is a continuous improvement system.
- 1 week later: validate against real-world usage; if the skill still feels wrong, the evals are wrong
- When upgrading the model: resume from prior logs instead of starting blind
- When changing structure heavily: archive the old run and establish a new baseline
- Monthly: review discard patterns, deletion results, and stalled score trends
Step 10: False Positive Tracking
If eval scores are high but actual output quality is low, you have a false positive problem.
- collect 10+ real outputs
- review where evals failed to capture quality
- tighten the eval suite
- restart from a clean baseline if necessary
Output Structure
autoresearch-[skill-name]/
├── results.json
├── results.tsv
├── dashboard.html
├── changelog.md
├── research-log.json
├── <target-skill-filename>.baseline
├── run-harness.md
└── runs/
├── baseline/
├── exp-1/
└── ...
Resuming a Previous Run
If autoresearch-[skill-name]/ already exists:
- Read
changelog.md and research-log.json
- Read
results.json to find the best score and next experiment number
- Read
<target-skill-filename>.baseline
- Reconstruct the prior experiment contract from
run-harness.md, including target path, eval suite, versioning mode, and termination settings
- Compare the current target files against the last accepted baseline using hashes, git history, or explicit file diff
- If the target skill, eval contract, or harness changed materially, do not resume blindly:
- either archive the old run and start a fresh baseline
- or document exactly why the old baseline is still comparable
- If versioning mode is
git-assisted, re-checkout the autoresearch branch if needed
- Re-validate the run harness before resuming
- Continue from the next experiment number
If the model changed, read the prior research log and avoid re-running obviously poor directions.
Limitations
| Limitation | Mitigation |
|---|
| Evals can check structure better than true quality | Human review plus false-positive tracking |
| Strict evals can suppress creativity | Keep only core rules binary; use comparative evals for quality |
| AI can game evals | Write principle-level assertions, not brittle micro-rules |
| Sequential runs can leak context | Log contamination risk and prefer fresh runs when available |
| Cost grows quickly | Control runs-per-experiment and budget cap |
| Overfitting to prompts | Use diverse prompts and rotate periodically |
The Test
A good autoresearch run:
- established a real harness
- created a baseline before mutating
- used evals that produce stable scores
- saved artifacts per experiment
- improved score without hiding regressions
- used safe git ratcheting
- kept the skill simpler when possible
- recorded direction shifts
- improved actual output quality, not just compliance
If the skill passes evals but still feels worse in practice, the evals are the problem. Fix them and rerun.