with one click
toolkit-evolution
// Closed-loop toolkit self-improvement: discover gaps, diagnose, propose, critique, build, test, evolve.
// Closed-loop toolkit self-improvement: discover gaps, diagnose, propose, critique, build, test, evolve.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | toolkit-evolution |
| description | Closed-loop toolkit self-improvement: discover gaps, diagnose, propose, critique, build, test, evolve. |
| user-invocable | true |
| argument-hint | <optional: focus area like 'routing' or 'hooks'> |
| command | evolve |
| context | fork |
| allowed-tools | ["Read","Write","Edit","Bash","Glob","Grep","Agent","Skill"] |
| routing | {"triggers":["evolve toolkit","improve the system","self-improve","toolkit evolution","what should we improve","find improvement opportunities","discover skill gaps","what skills are missing","systematic improvement"],"pairs_with":["multi-persona-critique","skill-eval"],"complexity":"Complex","category":"meta-tooling"} |
Schedulable (nightly) or manually-invoked 7-phase pipeline for continuous toolkit self-improvement. Discovers gaps, diagnoses problems from evidence, proposes solutions, critiques via multi-persona review, builds winners on isolated branches, A/B tests, and promotes via PR.
Nightly sibling of auto-dream (2:07 AM consolidates memories; 3:07 AM this skill diagnoses and builds). They feed each other: dream's graduated learnings inform evolution's diagnosis; evolution's results become dream's next input.
Invoke: /evolve, /evolve routing, /evolve hooks, /evolve --discover. Cron setup in references/evolve-preferred-patterns.md § Scheduling.
| Signal | Load These Files | Why |
|---|---|---|
| tasks related to this reference | diagnose-scripts.md | Loads detailed guidance from diagnose-scripts.md. |
| tasks related to this reference | evolution-report-template.md | Loads detailed guidance from evolution-report-template.md. |
| implementation patterns | evolve-preferred-patterns.md | Loads detailed guidance from evolve-preferred-patterns.md. |
| tasks related to this reference | evolve-scripts.md | Loads detailed guidance from evolve-scripts.md. |
Goal: Identify skills, agents, or capability categories the toolkit should have but doesn't. While later phases improve existing components, this phase finds entirely new capabilities the toolkit is missing.
Frequency: Monthly, not every run. The DISCOVER phase only executes if:
--discover flag is passed explicitly, ORCheck the last discovery run date using the frequency check command from references/diagnose-scripts.md § Discovery Frequency Check.
If neither condition is met, skip directly to Phase 1.
Step 1: Gather briefing data
Collect current toolkit state using the briefing data commands from references/diagnose-scripts.md § DISCOVER Step 1. Brief all 5 perspective agents with the same baseline.
Step 2: Dispatch 5 perspective agents in parallel
See references/evolve-preferred-patterns.md § Phase 0 DISCOVER for the full agent table and proposal format. Dispatch all 5 simultaneously.
Step 3: Deduplicate and filter -- remove duplicates of existing skills (check skills/INDEX.json), remove proposals with no evidence (require at least one concrete data point), group similar proposals and note convergent evidence.
Step 4: Feed into DIAGNOSE -- append surviving proposals to the Phase 1 opportunity list with source tagged [DISCOVER].
Step 5: Save discovery report to evolution-reports/discovery-{YYYY-MM-DD}.md (run mkdir -p evolution-reports first). Include briefing data, all proposals, filtering rationale, forwarded proposals, and date stamp.
Gate: Discovery report saved. Proposals forwarded to Phase 1. Proceed to DIAGNOSE.
Goal: Identify 5-10 evidence-backed improvement opportunities from multiple data sources.
Step 1: Query the learning database for recent failures and routing mismatches
Run the 4 search queries from references/diagnose-scripts.md § DIAGNOSE Step 1.
Look for: routing decision patterns, recurring routing failures and mismatches, skills that consistently underperform, error patterns without automated fixes.
Step 2: Scan recent git history for patterns
Run the git history commands from references/diagnose-scripts.md § DIAGNOSE Step 2.
Step 3: Check auto-dream reports for accumulated insights
Run the dream report check from references/diagnose-scripts.md § DIAGNOSE Step 3, then read the most recent dream-analysis file.
Step 3b: Cross-validate dream insights against current state
Before treating any dream insight as a proposal signal, verify it still reflects the current repo. Use the cross-validation commands from references/diagnose-scripts.md § DIAGNOSE Step 3b.
Mark an insight as STALE if: (a) it names a file that no longer exists, OR (b) it claims recent activity but git log shows nothing in the past 7 days.
Step 4: Check routing-table drift
Skills present in skills/INDEX.json but absent from the routing manifest represent a documentation gap. Run the routing-drift check from references/diagnose-scripts.md § DIAGNOSE Step 4.
Step 4b: Check for orphaned ADR session files
Run the orphaned session check from references/diagnose-scripts.md § DIAGNOSE Step 4b. Flag any found -- do not remove automatically.
Step 4c: Scan for registered stub hooks
Run the stub hook audit from references/diagnose-scripts.md § DIAGNOSE Step 4c. Flag any stub hook as a cleanup opportunity.
Step 5: Narrow by focus area (if provided)
If the user specified a focus area (e.g., "routing", "hooks", "agents"), filter all findings to that domain.
Step 6: Compile opportunity list
Output a numbered list of 5-10 improvement opportunities. Each entry must include:
Gate: At least 3 evidence-backed opportunities identified. If fewer than 3, expand the time window or broaden the data sources. Do not proceed with speculative opportunities that lack evidence.
Goal: Transform opportunities into actionable proposals with clear scope.
Step 1: Generate proposals
For each opportunity from Phase 1, propose 1-2 concrete solutions. Each proposal must be actionable:
Step 2: Estimate effort
| Effort | Definition |
|---|---|
| Small | Single file edit, <30 lines changed |
| Medium | 2-5 files, new reference or script, <200 lines |
| Large | New skill or agent, multiple components, >200 lines |
Step 3: Check for duplicates
cat skills/INDEX.json | python3 -c "import sys,json; idx=json.load(sys.stdin); [print(k,'-',v.get('description','')) for k,v in idx.get('skills',{}).items()]" 2>/dev/null || echo "INDEX.json parse failed -- check manually"
Drop any proposal that duplicates an existing skill or capability.
Step 4: Rank proposals
Rank by: (Impact score) x (1 / Effort score), where High=3, Medium=2, Low=1 and Small=1, Medium=2, Large=3.
Output: ranked list of 5-10 proposals, each with proposal description, scope, effort, and expected outcome.
Gate: All proposals are concrete (specific files/skills named), non-duplicative (verified against INDEX.json), and ranked. Proceed with the top 5.
Goal: Evaluate proposals from multiple perspectives to surface blind spots.
Step 1: Check for multi-persona-critique skill
test -f skills/research/multi-persona-critique/SKILL.md && echo "AVAILABLE" || echo "NOT AVAILABLE"
Step 2a: If multi-persona-critique is available
Skill(skill="multi-persona-critique", args="Evaluate these toolkit improvement proposals: {proposals}")
Step 2b: If NOT available -- use inline fallback
See references/evolve-preferred-patterns.md § Phase 3 Inline Critique Fallback for the 3-agent dispatch prompts and scoring table.
Step 3: Synthesize consensus
For each proposal, average persona scores (STRONG=3, MODERATE=2, WEAK=1):
Gate: All personas have reported. Synthesis complete. At least 1 proposal rated STRONG. If no STRONG proposals, revisit Phase 2 with the critique feedback, or report to user that no high-confidence improvements were found this cycle.
On early exit (no STRONG proposals): always record to the learning DB before stopping. See references/evolve-scripts.md § Early Exit Record for the learning-db command template.
Goal: Implement the top 1-3 STRONG-rated proposals on isolated feature branches.
Constraint: Maximum 3 implementations per cycle. Focus over breadth.
Step 1: Select winners
Take the top 1-3 proposals rated STRONG by consensus. Do not pad with MODERATE proposals.
Step 2: Dispatch implementation agents
For each winner, dispatch an implementation agent in an isolated context. See references/evolve-scripts.md § Build Dispatch for the proposal-type to implementation-approach table.
Each implementation must create a feature branch feat/evolve-{proposal-slug} and commit with a descriptive message.
Step 3: Validate -- run python3 -m scripts.skill_eval.quick_validate skills/{skill-name}, python3 -m py_compile {script}, and bash -n {script} on each implementation.
Gate: All implementations committed on feature branches. Basic validation passed. Proceed to testing.
Goal: Empirically verify that each implementation improves outcomes vs baseline.
Step 1: Create test cases
For each implementation, create 3-5 realistic test prompts that exercise the changed behavior.
Step 2: Run comparisons
See references/evolve-scripts.md § Validate Run for the skill-eval command and manual fallback pattern.
Step 3: Evaluate results
Win condition for each implementation:
Gate: All implementations tested. Win/loss determined for each. Evidence recorded.
Goal: Ship winners via PR, record all outcomes in the learning database.
Step 1: Handle winners (WIN status)
For each winning implementation, create a PR using the template from references/evolve-scripts.md § Step 1, then merge. After creating the PR, run pr-review to validate, then merge.
The multi-persona critique + A/B testing gate is the review. Auto-merge is safe because the validation happened before this step.
Step 1b: Clean up the feature branch after merge
Use the cleanup commands from references/evolve-scripts.md § Step 1b.
Step 2: Handle losers (LOSS status)
Record what was tried and why it failed using the failure template from references/evolve-scripts.md § Step 2.
Step 3: Record the full cycle
Record using the full cycle template from references/evolve-scripts.md § Step 3.
Step 4: Write evolution report
Write the dated report to evolution-reports/evolution-report-{YYYY-MM-DD}.md using the template in references/evolution-report-template.md. See setup command in references/evolve-scripts.md § Step 4.
Gate: Winners merged. Learnings recorded for all proposals (wins and losses). Evolution report written. Cycle complete.
| Signal | Load |
|---|---|
| Running Phase 0 DISCOVER (frequency check, briefing data commands needed) | references/diagnose-scripts.md |
| Running Phase 1 DIAGNOSE (Steps 1-4c commands needed) | references/diagnose-scripts.md |
| Phase 0 perspective agent table, proposal format | references/evolve-preferred-patterns.md |
| Phase 3 inline critique fallback (multi-persona not available) | references/evolve-preferred-patterns.md |
| Anti-patterns, error handling, cost estimate, cron scheduling | references/evolve-preferred-patterns.md |
| Running Phase 6 EVOLVE (PR template, merge, cleanup, learning DB commands) | references/evolve-scripts.md |
| Writing or reading the evolution report | references/evolution-report-template.md |
references/evolution-report-template.md -- Template for the evolution reportreferences/diagnose-scripts.md -- Phase 0 and Phase 1 bash/Python commandsreferences/evolve-scripts.md -- Phase 6 PR, merge, cleanup, and learning DB commandsreferences/evolve-preferred-patterns.md -- Anti-patterns, error handling, cost, critique fallback, schedulingskills/meta/auto-dream/SKILL.md -- Nightly sibling: memory consolidation and learning graduationskills/meta/skill-eval/SKILL.md -- Skill testing and benchmarkingskills/research/multi-persona-critique/SKILL.md -- Multi-persona evaluation (may not exist yet; inline fallback in references)skills/meta/skill-creator/SKILL.md -- Skill creation methodologyskills/meta/agent-comparison/SKILL.md -- A/B testing methodologyskills/infrastructure/headless-cron-creator/SKILL.md -- Cron job creation patterns