| name | optimizing-skills |
| description | Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose. |
| metadata | {"version":"0.2.0"} |
Optimizing Skills
Treat the skill document as the parameter under optimization: change it only
when the change demonstrably beats the version you already ship. This is
the discipline distilled from SkillOpt (microsoft/SkillOpt, arXiv:2605.23904) —
its training apparatus dropped, its reproducibility discipline kept. The point
is to stop editing skills on intuition and start editing them on evidence.
Core principle
A skill edit is only worth shipping if it strictly improves measured behavior
on a held-out check. Most edits that feel like improvements don't move the
needle, and some quietly regress. The gate below is what separates a real
improvement from a confident guess.
The gate — run it every revision
- Assemble a held-out check set. 3–8 representative tasks/prompts the skill
should handle well, and it must include the failure(s) that prompted this
revision. Keep the set fixed across the revision so before/after scores are
comparable.
- Hold two versions.
best = what you currently ship (never let it
silently degrade). candidate = best + your proposed edits.
- Score both on the check set. "Run" here = dispatch each check task to the
Agent tool (
subagent_type=general-purpose) with the skill version in
context, or evaluate by hand for small sets. Score per criterion, not one
collapsed pass/fail. When a task carries several criteria, the criterion that
decides accept/reject is the failure that prompted this revision; the
others are regression guards that must not get worse. Collapsing criteria
masks the win: in the down-skilling-v1.2.0 retro, the edit drove architectural
hallucination 60%→0% while an unrelated length criterion stayed 0/5 in both
arms — a single combined pass/fail scored that as a 0–0 tie and would have
rejected a large, real improvement.
- Accept only if
candidate strictly beats best on the triggering-failure
criterion, with no regression guard worse. Ties → reject, keep best. An edit
that doesn't move the needle does not ship.
When the skill's own output is compiled by an Agent (down-skilling and
creating-skill produce a prompt an author writes from the SKILL), score ≥2
author samples per version, or fix one author across both arms. A single
author sample per arm lets author capability dominate the edit effect: the same
down-skilling edit measured 95%→0% with one author pair and 60%→0% with another
— real either way, but n=1 cannot tell a real edit from a lucky author.
This two-tier best/candidate split is the heart of it: a working revision
can explore, but the shipped skill only ever ratchets upward.
Bounded edits (the "textual learning rate")
Cap edits per revision — default ~4 distinct add/replace/delete operations,
fewer as the skill matures. Large speculative rewrites drift and destroy your
ability to attribute a regression to a cause. If a revision wants more edits
than the budget, rank and keep the top ones (below) and let the rest wait.
Reflect: failures first, then successes
Separate the evidence before proposing edits:
- Failure reflection. Across the failing cases, find the common,
systematic pattern — not a one-off edge case. Propose edits that fix the
pattern. Failures take priority in any merge.
- Success reflection. Across cases that already work, find generalizable
patterns worth encoding so they survive future edits. Reinforce; don't
duplicate.
For both: edits must generalize (never hardcode task-specific values), and
must not duplicate content already in the skill — patch genuine gaps only.
Rank when over budget
When candidate edits exceed the budget, keep them in this priority order:
- Systematic impact — fixes a recurring failure across many cases, not one.
- Complementarity — fills a real gap rather than restating existing content.
- Generality — phrased as a durable principle, not tied to one task/entity.
- Actionability — concrete, followable guidance over vague advice.
Drop the rest. They can return next revision if still warranted.
Protect the hard-won core
If a skill has a battle-tested core that routine edits keep eroding, fence it
off and treat it as off-limits to fast edits. Revisit it only on a deliberate
longitudinal review: compare the same check tasks across several versions to
catch slow drift and regressions that single-edit review misses. (SkillOpt
fences this region with HTML-comment markers and only rewrites it at epoch
boundaries — the same idea, manual cadence.)
Carry memory across revisions
After a revision, record what you learned about editing this skill — which
kinds of edits helped, which were brittle, redundant, or harmful — via
remember() tagged with the skill name. Before the next revision, recall()
it. This is the compounding part: each revision starts smarter than the last,
the way SkillOpt's optimizer-side meta-skill conditions its future edits.
Edit mechanics
Edits are literal string operations (the Edit tool): the target text must
match exactly or the edit is a silent no-op. Keep targets unique and
verbatim. Prefer append / insert-after-heading / replace-exact / delete-exact,
and verify each edit landed before scoring.
When NOT to use this
- Authoring a brand-new skill from scratch → creating-skill.
- Tracking/rolling back versions during development → versioning-skills.
- One-off prose with no reuse → just write it.
Checklist
For the deeper "dispatch reflection/scoring to the Agent tool" recipe and the
adapted reflection/ranking prompt templates, see
references/skillopt-provenance.md.