بنقرة واحدة
skill-optimizer
Self-evolving skill optimization via SkillOpt-paper-grounded text-space optimizer.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Self-evolving skill optimization via SkillOpt-paper-grounded text-space optimizer.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
استنادا إلى تصنيف SOC المهني
Proactive "make the most of gbrain" coaching. Runs `gbrain advisor` on a cadence and pings the user with the top high-leverage actions for their brain: version drift, pending migrations, stalled jobs, low embed coverage, setup smells, and uninstalled brain skills. Read-only; always asks before fixing.
When/what to retrieve — open the brain page for a salient entity before answering from memory.
Validate and auto-repair YAML frontmatter on brain pages. Catches malformed pages before they enter the brain (missing closing
Deduplicate and synthesize raw concept stubs into a tiered intellectual map (T1 Canon to T4 Riff), tracing idea evolution across sources over time. Transforms thousands of raw concept pages into a curated intellectual fingerprint.
Trace one idea's evolution through the brain: first mention, best articulation, related concepts, reversals, contradictions, abandoned branches, and the current live version. Use for single-idea conceptual lineage, not broad concept-map synthesis or structured entity metrics.
Route to Mars (introspective thought partner / demo showman voice persona). Used when the operator wants depth, meaning, or impressive social demos rather than logistics. Mars handles SOLO mode (philosophy, presence, patterns) and DEMO mode (tool-driven showmanship) automatically.
| name | skill-optimizer |
| version | 0.1.0 |
| description | Self-evolving skill optimization via SkillOpt-paper-grounded text-space optimizer. |
| triggers | ["optimize this skill","tune the skill against the benchmark","make the skill better","run skillopt","skillopt for"] |
| mutating | true |
| brain_first | exempt |
Self-evolving skill optimization. Treats SKILL.md as the trainable parameters of a frozen agent. Validation-gated, budget-capped, atomic-versioned.
Based on SkillOpt (arXiv 2605.23904, Microsoft Research, May 2026).
The user wants to:
triggers:, brain_first:) stays invariant.--allow-mutate-bundled AND --held-out <path> with
at least 5 benchmark-disjoint tasks; without the held-out set the run
hard-refuses (exit 2). Drop --allow-mutate-bundled (or pass --no-mutate,
the default for the dream-cycle phase) to write proposed.md for review
instead — no held-out needed for review-only output.--bootstrap-from-skill
and --bootstrap-from-routing write a sentinel; you must review + STRENGTHEN
the generated judges, delete the sentinel, and re-run with
--bootstrap-reviewed before optimization can use the file.gbrain skillopt <skill-name> [flags]
│
├── Pre-flight gates
│ ├── working tree clean (or --force)
│ ├── benchmark valid + D_sel >= 5 (D17)
│ ├── cost preflight (D3) — refuses over --max-cost-usd
│ └── per-skill DB lock (D14)
│
├── Baseline eval on D_sel (sets best_sel_score)
│
├── for epoch in 1..N:
│ for step in 1..steps_per_epoch:
│ ├── forward pass: rollouts on D_train batch
│ ├── backward pass: reflect × 2 (failures + successes per D7)
│ ├── rank + clip via LR cosine schedule
│ ├── apply edits (body-only per D5, tagged result per D9)
│ ├── validation gate: median-of-3 + epsilon=0.05 (D12)
│ └── if accept: commit via D8 history-intent-first
│ │
│ └── slow update (D6) if no improvement this epoch
│
└── Final test eval on D_test → run receipt
The user will NOT hand-write a benchmark, and you shouldn't start from a blank
file either. When the user says "make skill X better" and
skills/X/skillopt-benchmark.jsonl doesn't exist, generate a starter from the
SKILL.md directly:
gbrain skillopt X --bootstrap-from-skill
One LLM call reads skills/X/SKILL.md, infers what the skill produces and what
"good" looks like, and writes ~15 tasks (each with rule judges) to
skills/X/skillopt-benchmark.jsonl plus a # BOOTSTRAP_PENDING_REVIEW
sentinel. No routing-eval.jsonl is needed. Tune the count with
--bootstrap-tasks N (max 50).contains, loose max_chars, or invented headings. Read each task, fix soft
checks, add the must-haves the skill actually requires (real section names,
real length ceilings, min_citations where sources are expected,
tool_called/tool_not_called for tools the skill genuinely uses). A thin
benchmark optimizes for a thin definition of quality — do not rubber-stamp.# BOOTSTRAP_PENDING_REVIEW, the last line).--split 1:1:1:
gbrain skillopt X --bootstrap-reviewed --split 1:1:1
The 1:1:1 split is REQUIRED for a 15-task starter — the default 4:1:5 makes
the validation set floor(15/10)=1, below the D_sel >= 5 floor, and the
optimizer refuses with d_sel_too_small. (4:1:5 needs ~50 tasks.) Add
--dry-run first to preview cost.Benchmark line shape (what the generator writes, one per line):
{"task_id":"x-001","task":"<user prompt>","judge":{"kind":"rule","checks":[{"op":"max_chars","arg":1800},{"op":"contains","arg":"agenda"}]}}
Rule-check vocabulary you'll strengthen with: contains, regex,
section_present, max_chars, min_citations, tool_called, tool_not_called.
Rule judges are deterministic and free, but shallow for skills whose quality is
sequencing, privacy, refusal boundaries, or file placement — for those, hand-add
richer checks (or an llm judge) during review.
Fallback — author freehand. If the generated starter is poor (rare, but
possible for very behavior-shaped skills), discard it and write the JSONL
yourself: read the SKILL.md, write ~15 realistic tasks covering the boring middle,
attach >=2 rule checks each, save to skills/X/skillopt-benchmark.jsonl, run with
--split 1:1:1. The human walkthrough lives at
docs/tutorials/improving-skills-with-skillopt.md.
| Situation | Action |
|---|---|
| Skill has no benchmark | gbrain skillopt foo --bootstrap-from-skill → review + strengthen the judges → delete sentinel → gbrain skillopt foo --bootstrap-reviewed --split 1:1:1 (see section above) |
Skill has a routing-eval.jsonl and you want a head start | gbrain skillopt foo --bootstrap-from-routing → review the generated tasks → --bootstrap-reviewed (routing tasks test dispatch; tighten them into quality tasks before trusting) |
| Iterating on an existing skill | gbrain skillopt foo --benchmark skills/foo/skillopt-benchmark.jsonl |
| Costly run, want preview | Add --dry-run |
| Bundled skill (skills/ in gbrain repo) | Default writes proposed.md; to commit in place add --allow-mutate-bundled AND --held-out <path> (>=5 benchmark-disjoint tasks) — else it hard-refuses |
| Want to review changes before applying | Add --no-mutate (writes proposed.md, no held-out needed) |
| Guard against benchmark overfitting | Add --held-out <path> — a candidate that beats the benchmark but regresses on the held-out set is refused |
| Mid-run crash | gbrain skillopt foo --resume <run-id> |
When invoked, this skill produces:
skills/<name>/SKILL.md (when mutation is allowed)skills/<name>/skillopt/best.md — pointer copy of current bestskills/<name>/skillopt/versions/vNNNN_eN_sN.md — per-step snapshotsskills/<name>/skillopt/history.json — append-only run recordskills/<name>/skillopt/rejected.json — bounded LRU of rejected edits~/.gbrain/audit/skillopt-YYYY-Www.jsonl — ISO-week-rotated audit trail--allow-mutate-bundled AND
--held-out. They ship with gbrain and are load-bearing for downstream
agents. In-place mutation requires both flags (held-out >=5 benchmark-disjoint
tasks); without the held-out set the run hard-refuses and points you at
proposed.md.--bootstrap-from-skill and --bootstrap-from-routing have the optimizer
model invent success criteria — generic and weak by default. Review and
tighten the judges before SkillOpt optimizes against them, or it trains the
skill toward benchmark artifacts instead of real quality.--split 1:1:1 on a ~15-task starter. The default 4:1:5
split drops the validation set below the D_sel >= 5 floor and the run
aborts with d_sel_too_small.runSkillOpt(opts) returns:
{
outcome: 'accepted' | 'no_improvement' | 'aborted' | 'errored',
receipt: {
run_id, skill_sha8, benchmark_sha8, models, cost,
baseline_sel_score, best_sel_score, // real measured baseline (no longer hardcoded 0)
baseline_test_score, test_score, // final held-out test-split eval
},
finalText: string,
mutatedSkillFile: boolean,
proposedPath?: string
}
skillify — scaffolds a new skill (use BEFORE skillopt)skillpack-check — audits skill conformance (item 13 surfaces skillopt status)conventions/quality.md — output quality standards skillopt enforces via judges