Run any Skill in Manus with one click

$pwd:

autoresearch

Name: Autoresearch
Author: byungjunjang

// Autonomously optimize any Codex skill or agent system by running it repeatedly, scoring outputs against evals (binary for rules + comparative for quality), mutating any owned artifact — the skill's prompt, reference assets, and executable artifacts (scripts, agent/subagent definitions like `agents/openai.yaml`, MCP servers, hooks, harness code) — and keeping improvements. Based on Karpathy's autoresearch methodology. Use this skill whenever the user mentions optimizing a skill, improving a skill or agent, running autoresearch, making a skill or agent better, self-improving a skill, benchmarking a skill, evaluating a skill, running evals on a skill, optimizing an agent system, or any request to iteratively test and refine a skill or agent — even if they don't use the word "autoresearch" explicitly. Also trigger on 스킬 개선, 스킬 최적화, 스킬 벤치마크, 스킬 평가, 에이전트 개선, 에이전트 최적화. Outputs an improved target skill file (and any mutated executable artifacts), a results log, a changelog, and a research log of meaningful direction

Run Skill in Manus

$ git log --oneline --stat

stars:82

forks:43

updated:May 6, 2026 at 08:26

File Explorer

9 files

SKILL.md

readonly

package.json

"author": "byungjunjang"

"repository": "byungjunjang/jangpm-meta-skills"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	autoresearch
description	Autonomously optimize any Codex skill or agent system by running it repeatedly, scoring outputs against evals (binary for rules + comparative for quality), mutating any owned artifact — the skill's prompt, reference assets, and executable artifacts (scripts, agent/subagent definitions like `agents/openai.yaml`, MCP servers, hooks, harness code) — and keeping improvements. Based on Karpathy's autoresearch methodology. Use this skill whenever the user mentions optimizing a skill, improving a skill or agent, running autoresearch, making a skill or agent better, self-improving a skill, benchmarking a skill, evaluating a skill, running evals on a skill, optimizing an agent system, or any request to iteratively test and refine a skill or agent — even if they don't use the word "autoresearch" explicitly. Also trigger on 스킬 개선, 스킬 최적화, 스킬 벤치마크, 스킬 평가, 에이전트 개선, 에이전트 최적화. Outputs an improved target skill file (and any mutated executable artifacts), a results log, a changelog, and a research log of meaningful direction shifts.
metadata	{"short-description":"Eval-driven optimization for Codex skills and pipelines"}

Autoresearch for Skills

Most skills work about 70% of the time. The remaining 30% is where vague instructions, weak examples, and brittle rules show up. The fix is not "rewrite it from scratch." The fix is to run the skill repeatedly, score the outputs, mutate the skill, and keep only what measurably helps.

This skill adapts Andrej Karpathy's autoresearch methodology to Codex skills and agent systems. Karpathy mutated ML training code; here we mutate every artifact a skill owns: SKILL.md prose, reference assets in references/, and executable artifacts the skill invokes (scripts, agent/subagent definitions like agents/openai.yaml, MCP servers, hooks, harness code).

The Core Job

Take an existing skill, define what good output looks like, then run a loop that:

Generates outputs from the skill using test inputs
Scores each output against eval criteria
Mutates any owned artifact — SKILL.md prose (L1), reference assets in references/ (L2a), executable artifacts the skill invokes such as scripts, agent/subagent definitions, MCP servers, hooks, harness code (L2b), or eval criteria (L3)
Keeps improvements and discards regressions
Repeats until a stop condition is reached

Expected outputs for the target skill path:

improved target skill file
results.tsv
changelog.md
research-log.json
dashboard.html

Before Starting: Gather the Experiment Contract

Do not block on a perfect spec, but do establish a minimum viable experiment contract.

Target skill(s) — exact path to the target SKILL.md; for a pipeline, list all skills in order
Pipeline mode — single skill or multi-skill pipeline; default is single
Owned executable artifacts — list all scripts, tool implementations, agent definitions (e.g., agents/openai.yaml), MCP servers, hooks, and harness files the skill invokes. Default: in-scope as L2b mutation candidates unless the user excludes them. If the skill is pure prompt + static references, the list is empty and L2b is unused. See references/mutation-guide.md.
Test inputs — 3-5 prompts or scenarios; if missing, draft a starter set and state the assumption
Eval criteria — 3-6 binary checks plus 1-2 comparative quality checks where possible
Runs per experiment — default: 3 for light skills, 1-2 for heavy skills, 5 only when cheap
Budget cap — default: 5 total experiments in one Codex turn unless the user wants more
Termination conditions — default: stop at budget cap or after 95%+ binary pass rate for 3 consecutive accepted experiments
Human review mode — default: review baseline plus the first meaningful keep; use skip only when the user explicitly wants unattended mode
Execution mode — default: sequential in the current agent; use subagents only if the user explicitly asks for delegation or parallel agent work
Run harness — define the exact repeatable command or workflow that constitutes "running the skill"
Versioning mode — default: git-assisted when a clean local git workflow is practical, otherwise file-checkpoint

If the user provides an evals.json, use that instead of drafting items 4-5.

Execution mode rules:

Use sequential by default
Do not spawn subagents unless the user explicitly authorized delegation
Treat "run the skill" as an explicit harness, not a vague conversational attempt
If fresh runs are unavailable, continue sequentially and note context contamination risk
Do not assume git is available or appropriate; decide versioning mode explicitly before baseline

Step 1: Read the Target Skill

Before changing anything:

Read the target skill file at the exact path captured in the experiment contract
Read linked files in references/ (L2a candidates) and helpers in scripts/ (L2b candidates)
Read every executable artifact the skill invokes — scripts, tool implementations, agent/subagent definitions, MCP servers, hook scripts (L2b candidates). Use the list captured in the experiment contract item 3 as the starting set; verify by tracing every command and tool call the skill performs
Identify the core job, workflow, output format, and anti-patterns
Note any existing quality checks already embedded in the skill
If the target is a Codex skill and agents/openai.yaml exists, read it and verify the UI metadata still matches the skill — note that agents/openai.yaml is an L2b mutation candidate

Step 2: Design the Eval Suite

Use the eval guidance in references/eval-guide.md.

Three eval types are allowed:

Binary evals — pass/fail rule compliance
Comparative evals — win/tie/loss on subjective quality dimensions
Fidelity evals — pipeline stage consistency in multi-skill mode

Scoring:

Binary: pass = 1, fail = 0
Comparative: win = 1, tie = 0.5, loss = 0
Fidelity: pass = 1, fail = 0
max_score = total assertions x runs per experiment

Use the highest-determinism eval you can. LLM-as-judge is acceptable only when the rubric is explicit enough for repeatable scoring.

Eval Determinism Hierarchy

When designing evals, prefer the highest-determinism check available.

Tier 1 - Deterministic checks

regex, required section presence, file existence
JSON/YAML parse success
character-count or item-count bounds

Tier 2 - Structural validation

heading hierarchy
table shape consistency
code block formatting or schema-level structure

Tier 3 - LLM-as-judge

tone, usefulness, completeness, quality, or other subjective criteria that cannot be checked programmatically

Target at least 50% of the eval suite to be Tier 1-2. If most evals are Tier 3, the loop becomes too noisy to trust.

Eval Quality Check

Before locking the eval suite, run this 3-question test on each eval:

Would two different reviewers likely score the same output the same way?
Could the skill game this check without becoming genuinely better?
Does this check capture something the user actually cares about?

If any answer is weak, tighten or replace the eval.

Step 3: Define the Run Harness

Each experiment needs a repeatable harness that executes the skill and collects outputs.

Acceptable harnesses:

A local script or command that runs the workflow end to end
A bounded manual protocol with fixed prompt, fixed output path, and deterministic artifact capture
A delegated subagent task only when the user explicitly approved delegation

Before baseline, write the harness into run-harness.md inside the run folder so future experiments are comparable.

Also create the live dashboard before running experiments. Follow references/dashboard-guide.md and create dashboard.html as a self-contained file that is updated by inlining the latest results.json data after each experiment.

If you cannot define a trustworthy harness, stop calling it autoresearch and switch to "skill rewrite + manual review" mode.

See references/execution-guide.md.

Step 4: Establish Baseline

If autoresearch-[skill-name]/ already exists, skip baseline creation and go to Resuming a Previous Run.

Baseline is experiment #0.

Create autoresearch-[skill-name]/ and runs/baseline/
Create results.json, results.tsv, changelog.md, research-log.json, dashboard.html, and run-harness.md
Back up the original skill as <target-skill-filename>.baseline in the run folder
Run the skill as-is with the selected test inputs
Copy all outputs into runs/baseline/<prompt-id>/
Score every output against every eval
Record the baseline score
If versioning mode is git-assisted, create a git branch: autoresearch/[skill-name] (add -N suffix if needed)
If versioning mode is git-assisted and the run folder should stay untracked, add the autoresearch folder to .gitignore only when that is safe for the repo
Persist the baseline snapshot:
- git-assisted: commit the baseline skill files explicitly by path, for example git add <target-skill-path> .gitignore && git commit -m "autoresearch: baseline ([score]/[max])"
- file-checkpoint: record baseline hashes in run-harness.md and keep the copied .baseline file as the restore source

After baseline, choose one mode explicitly:

interactive: report baseline and wait for approval
unattended: continue until budget or stop condition is hit

Default is interactive unless the user explicitly requested unattended looping.

Step 5: Human Review Phase

Skip this only when the user explicitly set human review mode to skip.

In Codex, human review is usually bounded:

review baseline
review the first meaningful keep
expand to 2-3 reviewed experiments only when the skill has strong subjective quality dimensions

For each reviewed experiment:

Analyze failures and form one clear hypothesis
Make one bounded change
Commit only the files mutated in that experiment
Run the experiment and score it
Present before/after score plus 2-3 representative outputs
Ask whether the direction feels right or whether the evals miss something
Log human insight as [HUMAN INSIGHT] in changelog.md
Keep or discard using the rollback rules from step 6
Mark the result as human-reviewed

Only switch to unattended mode if the user explicitly approves it.

Step 6: Run the Mutation Loop

Autonomy in Codex is batch-based, not open-ended.

Run unattended only when all of these are true:

the user explicitly asked for unattended auto mode
the run harness is reliable
rollback is safe for touched files
budget cap and stop conditions are written down

Otherwise run a bounded batch and report results.

Loop steps:

Analyze failing evals and inspect real failing outputs
Form one mutation hypothesis at the right level (L1 prompt rules / L2a reference assets / L2b executable artifacts the skill invokes / L3 eval calibration — see references/mutation-guide.md). If the failure stems from execution (wrong format, build error, tool returns garbage, subagent malformed result), choose L2b directly — do not paper over execution bugs with more prompt rules.
Checkpoint only the files you plan to touch
Make the bounded change
L2b mutations explicitly include changes to agents/openai.yaml, helper scripts in scripts/, MCP server code, hook scripts, and any subagent definitions. If a Codex skill's user-facing purpose changes, update agents/openai.yaml too.
Persist the mutation:
- git-assisted: git add <mutated-files> && git commit -m "autoresearch: [description]"
- file-checkpoint: save explicit pre-mutation copies or hashes for each touched file inside the run folder before evaluation
Run the experiment and save every produced artifact under runs/exp-N/<prompt-id>/
Score the outputs
Decide KEEP or DISCARD
Log the result
If this changed direction meaningfully, record it in research-log.json
Repeat until batch budget is exhausted or a stop condition is hit

Mutation Safety Rules

Each mutation is checkpointed before evaluation
KEEP -> the accepted version becomes the new baseline
DISCARD -> use a non-destructive rollback that matches the selected versioning mode:
- git-assisted: git reset --soft HEAD~1, then restore only the checkpointed files to their pre-experiment contents by explicit path
- file-checkpoint: restore only the touched files from the saved pre-mutation copies or the latest accepted baseline copies
Never use git reset --hard or any broad revert that can destroy unrelated user changes
If the repo was already dirty, record which files were pre-modified and exclude unrelated work from rollback
If git is unavailable, unwanted, or unsafe for the current repo, stay in file-checkpoint mode for the whole run

KEEP vs DISCARD Rules

Record skill_lines for every experiment with wc -l <target-skill-path>.

Use these defaults unless the user defined a stricter rule:

Score change	Prompt size change	Default decision
Improved meaningfully	Any reasonable increase	`KEEP`
Improved marginally	Large prompt increase	`DISCARD` unless the gain fixes an important failure
Flat	Shorter or simpler prompt	`KEEP`
Flat	Longer or more complex prompt	`DISCARD`
Worse	Any	`DISCARD`

Additional guardrails:

If an eval that previously passed now fails, treat that as a regression and strongly prefer DISCARD even if the total score rose
When two versions score the same, prefer the shorter and simpler one
If the mutation changes user-facing behavior, compare representative outputs before keeping it

Deletion Experiments

Every 5th experiment, try a deletion mutation.

Remove recently added rules that may not contribute to score
If the score holds, keep the deletion
If the target skill file grows past 200% of baseline size, record a bloat warning

Stop Conditions

Stop when any of these is true:

budget cap reached
95%+ binary pass rate sustained for 3 consecutive accepted experiments
user stops the run
system resource or time limit reached
the run harness is no longer trustworthy

Running out of ideas is not a reason to stop; change mutation level or revisit eval design.

Step 7: Logging Rules

Use references/logging-guide.md.

At minimum:

results.tsv records experiment number, score, keep/discard, and concise rationale
changelog.md records each mutation, per-eval movement, and human insights
research-log.json stores direction shifts only, not every micro edit
dashboard.html should visualize baseline-to-current progress without external runtime dependencies when possible

Step 8: Report Back

When the loop pauses or ends, present:

Score summary: baseline -> final
Total experiments tried
Keep rate
Top 3 helpful changes
Human insights incorporated
Remaining failure patterns
Direction shifts
Prompt size change
Output file locations
Accepted git history

Step 9: Next Steps

Autoresearch is a continuous improvement system.

1 week later: validate against real-world usage; if the skill still feels wrong, the evals are wrong
When upgrading the model: resume from prior logs instead of starting blind
When changing structure heavily: archive the old run and establish a new baseline
Monthly: review discard patterns, deletion results, and stalled score trends

Step 10: False Positive Tracking

If eval scores are high but actual output quality is low, you have a false positive problem.

collect 10+ real outputs
review where evals failed to capture quality
tighten the eval suite
restart from a clean baseline if necessary

Output Structure

autoresearch-[skill-name]/
├── results.json
├── results.tsv
├── dashboard.html
├── changelog.md
├── research-log.json
├── <target-skill-filename>.baseline
├── run-harness.md
└── runs/
    ├── baseline/
    ├── exp-1/
    └── ...

Resuming a Previous Run

If autoresearch-[skill-name]/ already exists:

Read changelog.md and research-log.json
Read results.json to find the best score and next experiment number
Read <target-skill-filename>.baseline
Reconstruct the prior experiment contract from run-harness.md, including target path, eval suite, versioning mode, and termination settings
Compare the current target files against the last accepted baseline using hashes, git history, or explicit file diff
If the target skill, eval contract, or harness changed materially, do not resume blindly:
- either archive the old run and start a fresh baseline
- or document exactly why the old baseline is still comparable
If versioning mode is git-assisted, re-checkout the autoresearch branch if needed
Re-validate the run harness before resuming
Continue from the next experiment number

If the model changed, read the prior research log and avoid re-running obviously poor directions.

Limitations

Limitation	Mitigation
Evals can check structure better than true quality	Human review plus false-positive tracking
Strict evals can suppress creativity	Keep only core rules binary; use comparative evals for quality
AI can game evals	Write principle-level assertions, not brittle micro-rules
Sequential runs can leak context	Log contamination risk and prefer fresh runs when available
Cost grows quickly	Control runs-per-experiment and budget cap
Overfitting to prompts	Use diverse prompts and rotate periodically

The Test

A good autoresearch run:

established a real harness
created a baseline before mutating
used evals that produce stable scores
saved artifacts per experiment
improved score without hiding regressions
used safe git ratcheting
kept the skill simpler when possible
recorded direction shifts
improved actual output quality, not just compliance

If the skill passes evals but still feels worse in practice, the evals are the problem. Fix them and rerun.

name	autoresearch
description	Autonomously optimize any Codex skill or agent system by running it repeatedly, scoring outputs against evals (binary for rules + comparative for quality), mutating any owned artifact — the skill's prompt, reference assets, and executable artifacts (scripts, agent/subagent definitions like `agents/openai.yaml`, MCP servers, hooks, harness code) — and keeping improvements. Based on Karpathy's autoresearch methodology. Use this skill whenever the user mentions optimizing a skill, improving a skill or agent, running autoresearch, making a skill or agent better, self-improving a skill, benchmarking a skill, evaluating a skill, running evals on a skill, optimizing an agent system, or any request to iteratively test and refine a skill or agent — even if they don't use the word "autoresearch" explicitly. Also trigger on 스킬 개선, 스킬 최적화, 스킬 벤치마크, 스킬 평가, 에이전트 개선, 에이전트 최적화. Outputs an improved target skill file (and any mutated executable artifacts), a results log, a changelog, and a research log of meaningful direction shifts.
metadata	{"short-description":"Eval-driven optimization for Codex skills and pipelines"}