| name | Optimize |
| description | Autonomous optimization loop — hill-climb any target. Code with metrics, or skills/prompts/agents with LLM-as-judge eval. USE WHEN optimize, autoresearch, hill climb, improve metric, reduce latency, improve performance, benchmark optimization, bundle size, page speed, autonomous improvement loop, optimize skill, optimize prompt, improve quality, eval mode. |
| disable-model-invocation | true |
| effort | medium |
/optimize — Autonomous Optimization v2
Run an autonomous optimization loop against any target. Two modes:
- Metric mode — code targets with a shell command that produces a number (the original)
- Eval mode — skills, prompts, agents, or any text target judged by LLM-as-judge binary evals
The agent modifies the target, measures the result, keeps improvements, discards failures, and repeats.
Inspired by Karpathy's autoresearch and extended with LLM-as-judge evaluation.
Invocation
Metric Mode (code targets)
/optimize --metric "lighthouse_score" --higher-is-better \
--measure "npx lighthouse http://localhost:3000 --output=json" \
--extract "jq '.categories.performance.score * 100' lighthouse.json" \
--files "src/**/*.tsx,src/**/*.css" \
--budget 120
/optimize --resume # Resume a previous optimization loop
/optimize --status # Show results summary from last/current run
Eval Mode (skill/prompt/agent targets)
/optimize --target "~/.claude/skills/ExtractWisdom"
/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md"
/optimize --target "prompts/my-prompt.md"
/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 20
In eval mode, the system automatically:
- Detects the target type (skill, prompt, agent, code, function)
- Reads the target to understand its purpose and constraints
- Generates 3-6 binary eval criteria and 3-5 test inputs
- Presents criteria + inputs for your approval before starting
- Runs the optimization loop using LLM-as-judge scoring
- Presents a recommendation (apply/reject/partial) when done
What Happens
This skill triggers the PAI Algorithm in mode: optimize:
- OBSERVE — Define or auto-detect the target, set eval_mode
- THINK — Analyze codebase/skill, generate hypothesis queue
- PLAN — Prioritize hypotheses by expected impact
- BUILD — Phase 0: TARGET ANALYSIS (see
optimize-loop.md)
- Detect target type, auto-generate eval criteria (eval mode), set up sandbox, baseline
- EXECUTE — The autonomous loop (
optimize-loop.md):
- Hypothesize → Modify target → Measure (metric or eval) → Keep/Revert → Repeat
- Metric mode: ~12 experiments/hour (at 5-min budget)
- Eval mode: ~6-8 experiments/hour (multi-run judging is slower)
- VERIFY — Phase 9: RECOMMEND — diff, summary, apply/reject/partial options
- LEARN — Phase 10: EXTRACT LEARNINGS — what worked, what didn't, structured insights
Arguments — Metric Mode
| Argument | Required | Default | Description |
|---|
--metric NAME | yes | | Human-readable metric name |
--measure COMMAND | yes | | Shell command that produces the metric |
--files GLOB | yes | | Files the agent may modify (comma-separated) |
--higher-is-better | | (default) | Higher metric values are better |
--lower-is-better | | | Lower metric values are better |
--extract COMMAND | | Last number in stdout | Extract metric from output |
--budget SECONDS | | 300 | Time budget per experiment |
--target VALUE | | none | Stop when metric reaches this value |
--max-experiments N | | none | Stop after N experiments |
--locked GLOB | | none | Files the agent must NOT modify |
--constraints TEXT | | none | Additional rules (e.g., "tests must pass") |
Arguments — Eval Mode
| Argument | Required | Default | Description |
|---|
--target PATH | yes | | Path to skill directory, prompt file, or agent definition |
--max-experiments N | | none | Stop after N experiments |
--runs N | | 3 | Runs per experiment (more = more reliable, slower) |
--criteria "Q1" "Q2" | | auto-generated | Override auto-generated eval criteria |
--inputs "I1" "I2" | | auto-generated | Override auto-generated test inputs |
--budget SECONDS | | 300 | Time budget per experiment |
Shared Arguments
| Argument | Description |
|---|
--resume | Resume a previous optimization run |
--status | Show results summary |
Algorithm Integration
When /optimize is invoked, the Algorithm enters with mode: optimize in the ISA frontmatter. The eval_mode is set based on arguments:
--measure provided → eval_mode: metric (git branch sandbox)
--target provided → eval_mode: eval (directory sandbox)
ISC criteria become guard rails — assertions that must hold true across ALL experiments. Guard rails must REMAIN satisfied perpetually. A violation triggers automatic revert regardless of score improvement.
Reference files:
~/.claude/PAI/ALGORITHM/optimize-loop.md — the full loop protocol
~/.claude/PAI/ALGORITHM/eval-guide.md — how to write good eval criteria
~/.claude/PAI/ALGORITHM/target-types.md — target detection and ISC generation
Examples
Metric Mode
Optimize page load time:
/optimize --metric "lighthouse_perf" --higher-is-better \
--measure "npx lighthouse http://localhost:3000 --output=json --output-path=lh.json" \
--extract "jq '.categories.performance.score * 100' lh.json" \
--files "src/**/*.tsx,src/**/*.css" \
--target 95 --budget 120
Optimize bundle size:
/optimize --metric "bundle_bytes" --lower-is-better \
--measure "bun run build 2>&1 && du -sb dist/ | cut -f1" \
--files "src/**/*.ts" \
--constraints "all tests must pass"
ML training (Karpathy-style):
/optimize --metric "val_bpb" --lower-is-better \
--measure "uv run train.py > run.log 2>&1 && grep '^val_bpb:' run.log | cut -d' ' -f2" \
--files "train.py" \
--locked "prepare.py" \
--budget 300
Eval Mode
Optimize a skill's Extract workflow:
/optimize --target "~/.claude/skills/ExtractWisdom" --max-experiments 15
Optimize a standalone prompt:
/optimize --target "prompts/summarize-article.md" --runs 5
Optimize with custom criteria:
/optimize --target "~/.claude/skills/Research/Workflows/QuickResearch.md" \
--criteria "Does the output contain specific facts with sources?" \
"Is the output structured with clear sections?" \
"Does the output avoid generic filler?" \
--inputs "research quantum computing breakthroughs 2025" \
"quick research on supply chain security" \
"find recent developments in AI agents"
Gotchas
- Hill-climbing can get stuck in local optima. If score plateaus, consider resetting with different initial conditions.
- Eval mode vs metric mode: Use metric mode for quantifiable targets (latency, size). Use eval mode for qualitative targets (skill quality, prompt effectiveness).
- Regression tolerance prevents catastrophic changes. Don't set it to 0 — some regression in secondary metrics is acceptable if primary metric improves significantly.