| name | skill-optimizer |
| version | 0.1.0 |
| description | Self-evolving skill optimization via SkillOpt-paper-grounded text-space optimizer. |
| triggers | ["optimize this skill","tune the skill against the benchmark","make the skill better","run skillopt","skillopt for"] |
| mutating | true |
| brain_first | exempt |
Skill Optimizer
Self-evolving skill optimization. Treats SKILL.md as the trainable parameters
of a frozen agent. Validation-gated, budget-capped, atomic-versioned.
Based on SkillOpt (arXiv 2605.23904, Microsoft Research, May 2026).
When to invoke this skill
The user wants to:
- Improve an existing skill's execution quality against a benchmark
- Bootstrap a benchmark file for a new skill
- Re-tune a skill after switching target models
Iron Law
- Validation gating is MANDATORY. Every candidate must clear median-of-3
- epsilon=0.05 margin against the sel-set before SKILL.md gets rewritten.
- Frontmatter mutation is FORBIDDEN. The optimizer only edits the body.
Routing surface (
triggers:, brain_first:) stays invariant.
- Bundled skills require explicit opt-in AND an independent held-out set.
Skills shipping with gbrain cannot be auto-mutated. To rewrite one in place
the user passes BOTH
--allow-mutate-bundled AND --held-out <path> with
at least 5 benchmark-disjoint tasks; without the held-out set the run
hard-refuses (exit 2). Drop --allow-mutate-bundled (or pass --no-mutate,
the default for the dream-cycle phase) to write proposed.md for review
instead — no held-out needed for review-only output.
- Bootstrap output requires human review. Both
--bootstrap-from-skill
and --bootstrap-from-routing write a sentinel; you must review + STRENGTHEN
the generated judges, delete the sentinel, and re-run with
--bootstrap-reviewed before optimization can use the file.
The pipeline
gbrain skillopt <skill-name> [flags]
│
├── Pre-flight gates
│ ├── working tree clean (or --force)
│ ├── benchmark valid + D_sel >= 5 (D17)
│ ├── cost preflight (D3) — refuses over --max-cost-usd
│ └── per-skill DB lock (D14)
│
├── Baseline eval on D_sel (sets best_sel_score)
│
├── for epoch in 1..N:
│ for step in 1..steps_per_epoch:
│ ├── forward pass: rollouts on D_train batch
│ ├── backward pass: reflect × 2 (failures + successes per D7)
│ ├── rank + clip via LR cosine schedule
│ ├── apply edits (body-only per D5, tagged result per D9)
│ ├── validation gate: median-of-3 + epsilon=0.05 (D12)
│ └── if accept: commit via D8 history-intent-first
│ │
│ └── slow update (D6) if no improvement this epoch
│
└── Final test eval on D_test → run receipt
Starting a benchmark from the skill itself (the common case)
The user will NOT hand-write a benchmark, and you shouldn't start from a blank
file either. When the user says "make skill X better" and
skills/X/skillopt-benchmark.jsonl doesn't exist, generate a starter from the
SKILL.md directly:
- Generate the starter. Run:
gbrain skillopt X --bootstrap-from-skill
One LLM call reads skills/X/SKILL.md, infers what the skill produces and what
"good" looks like, and writes ~15 tasks (each with rule judges) to
skills/X/skillopt-benchmark.jsonl plus a # BOOTSTRAP_PENDING_REVIEW
sentinel. No routing-eval.jsonl is needed. Tune the count with
--bootstrap-tasks N (max 50).
- Review AND STRENGTHEN the judges. This is YOUR job and it is load-bearing.
The generated rule checks are weak drafts — the model tends to emit generic
contains, loose max_chars, or invented headings. Read each task, fix soft
checks, add the must-haves the skill actually requires (real section names,
real length ceilings, min_citations where sources are expected,
tool_called/tool_not_called for tools the skill genuinely uses). A thin
benchmark optimizes for a thin definition of quality — do not rubber-stamp.
- Delete the sentinel line (
# BOOTSTRAP_PENDING_REVIEW, the last line).
- Run the optimizer with
--split 1:1:1:
gbrain skillopt X --bootstrap-reviewed --split 1:1:1
The 1:1:1 split is REQUIRED for a 15-task starter — the default 4:1:5 makes
the validation set floor(15/10)=1, below the D_sel >= 5 floor, and the
optimizer refuses with d_sel_too_small. (4:1:5 needs ~50 tasks.) Add
--dry-run first to preview cost.
Benchmark line shape (what the generator writes, one per line):
{"task_id":"x-001","task":"<user prompt>","judge":{"kind":"rule","checks":[{"op":"max_chars","arg":1800},{"op":"contains","arg":"agenda"}]}}
Rule-check vocabulary you'll strengthen with: contains, regex,
section_present, max_chars, min_citations, tool_called, tool_not_called.
Rule judges are deterministic and free, but shallow for skills whose quality is
sequencing, privacy, refusal boundaries, or file placement — for those, hand-add
richer checks (or an llm judge) during review.
Fallback — author freehand. If the generated starter is poor (rare, but
possible for very behavior-shaped skills), discard it and write the JSONL
yourself: read the SKILL.md, write ~15 realistic tasks covering the boring middle,
attach >=2 rule checks each, save to skills/X/skillopt-benchmark.jsonl, run with
--split 1:1:1. The human walkthrough lives at
docs/tutorials/improving-skills-with-skillopt.md.
Decision tree
| Situation | Action |
|---|
| Skill has no benchmark | gbrain skillopt foo --bootstrap-from-skill → review + strengthen the judges → delete sentinel → gbrain skillopt foo --bootstrap-reviewed --split 1:1:1 (see section above) |
Skill has a routing-eval.jsonl and you want a head start | gbrain skillopt foo --bootstrap-from-routing → review the generated tasks → --bootstrap-reviewed (routing tasks test dispatch; tighten them into quality tasks before trusting) |
| Iterating on an existing skill | gbrain skillopt foo --benchmark skills/foo/skillopt-benchmark.jsonl |
| Costly run, want preview | Add --dry-run |
| Bundled skill (skills/ in gbrain repo) | Default writes proposed.md; to commit in place add --allow-mutate-bundled AND --held-out <path> (>=5 benchmark-disjoint tasks) — else it hard-refuses |
| Want to review changes before applying | Add --no-mutate (writes proposed.md, no held-out needed) |
| Guard against benchmark overfitting | Add --held-out <path> — a candidate that beats the benchmark but regresses on the held-out set is refused |
| Mid-run crash | gbrain skillopt foo --resume <run-id> |
Output Format
When invoked, this skill produces:
- Updated
skills/<name>/SKILL.md (when mutation is allowed)
skills/<name>/skillopt/best.md — pointer copy of current best
skills/<name>/skillopt/versions/vNNNN_eN_sN.md — per-step snapshots
skills/<name>/skillopt/history.json — append-only run record
skills/<name>/skillopt/rejected.json — bounded LRU of rejected edits
~/.gbrain/audit/skillopt-YYYY-Www.jsonl — ISO-week-rotated audit trail
Anti-Patterns
- Don't bypass the validation gate. The median-of-3 + epsilon=0.05 is
load-bearing; without it, the optimizer accepts noise as improvement.
- Don't optimize bundled skills without
--allow-mutate-bundled AND
--held-out. They ship with gbrain and are load-bearing for downstream
agents. In-place mutation requires both flags (held-out >=5 benchmark-disjoint
tasks); without the held-out set the run hard-refuses and points you at
proposed.md.
- Don't use bootstrap output without strengthening it. Both
--bootstrap-from-skill and --bootstrap-from-routing have the optimizer
model invent success criteria — generic and weak by default. Review and
tighten the judges before SkillOpt optimizes against them, or it trains the
skill toward benchmark artifacts instead of real quality.
- Don't skip
--split 1:1:1 on a ~15-task starter. The default 4:1:5
split drops the validation set below the D_sel >= 5 floor and the run
aborts with d_sel_too_small.
Contract
runSkillOpt(opts) returns:
{
outcome: 'accepted' | 'no_improvement' | 'aborted' | 'errored',
receipt: {
run_id, skill_sha8, benchmark_sha8, models, cost,
baseline_sel_score, best_sel_score, // real measured baseline (no longer hardcoded 0)
baseline_test_score, test_score, // final held-out test-split eval
},
finalText: string,
mutatedSkillFile: boolean,
proposedPath?: string
}
Related skills
skillify — scaffolds a new skill (use BEFORE skillopt)
skillpack-check — audits skill conformance (item 13 surfaces skillopt status)
conventions/quality.md — output quality standards skillopt enforces via judges