| name | evaluate-improve |
| description | Suggest improvements to SKILL.md content, descriptions, or tool config from eval results. Use when raising pass rates, fixing triggering, or iterating on a skill after evaluation. |
| args | <plugin/skill-name> [--apply] [--description-only] [--best-of N] |
| allowed-tools | Task, Read, Write, Edit, Glob, Grep, Bash(bash *), Bash(python3 *), Bash(cat *), Bash(jq *), Bash(find *), Bash(diff *), AskUserQuestion, TodoWrite |
| argument-hint | git-plugin/git-commit [--apply] [--best-of 3] |
| created | "2026-03-04T00:00:00.000Z" |
| modified | "2026-06-23T00:00:00.000Z" |
| reviewed | "2026-03-04T00:00:00.000Z" |
/evaluate:improve
Analyze evaluation results and suggest concrete improvements to a skill.
When to Use This Skill
| Use this skill when... | Use alternative when... |
|---|
| Have eval results and want to improve the skill | Need to run evals first -> /evaluate:skill |
| Want to improve skill description for better triggering | Want to view raw results -> /evaluate:report |
| Iterating on a skill to increase pass rate | Want to file a bug -> /feedback:session |
| Optimizing skill instructions after benchmarking | Need structural fixes -> plugin-compliance-check.sh |
Parameters
Parse these from $ARGUMENTS:
| Parameter | Default | Description |
|---|
<plugin/skill-name> | required | Path as plugin-name/skill-name |
--apply | false | Apply approved changes to SKILL.md |
--description-only | false | Focus on description improvements only |
--best-of N | 1 | Generate N candidate revisions and apply the eval-ranked winner (requires --apply) |
--force-apply | false | Apply even when the delta-verify gate shows the edit does not shrink the source-failure set (override; requires --apply) |
Execution
Step 1: Load eval results
Read the most recent benchmark from:
<plugin-name>/skills/<skill-name>/eval-results/benchmark.json
If no results exist, suggest running /evaluate:skill first and stop.
Also read the current SKILL.md to understand the skill.
Capture the source-failure set. From the benchmark, record the set of
eval-case IDs that failed with the skill active — these are the cases the
forthcoming edit is meant to fix, and they are the input to the delta-verify
gate below:
cat <plugin>/skills/<skill>/eval-results/benchmark.json \
| jq -r '[.cases[] | select(.with_skill.passed == false) | .id]'
This set is distinct from the golden evals.json suite as a whole: the golden
set measures overall pass rate, the source-failure set measures whether the
edit fixed the specific failures that motivated it (AEGIS delta-verify). If
the set is empty (a clean benchmark, or no per-case data), there is nothing for
the gate to verify — skip it and proceed.
Step 2: Analyze results
Delegate analysis to the eval-analyzer agent via Task:
Task subagent_type: eval-analyzer
Prompt: Analyze these evaluation results and identify improvement opportunities.
Skill: <path to SKILL.md>
Benchmark: <benchmark.json contents>
Mode: comparison (if baseline data exists) or benchmark (otherwise)
The analyzer produces categorized suggestions:
- instructions: Execution flow improvements
- description: Better intent-matching text
- examples: Missing or insufficient examples
- error_handling: Missing edge cases
- tools: Better tool configurations
- structure: Organizational improvements
Step 3: Filter suggestions
If --description-only, filter to only description category suggestions.
Sort remaining suggestions by priority (high > medium > low).
Step 4: Present suggestions
Present the categorized suggestions to the user:
## Improvement Suggestions: <plugin/skill-name>
Current pass rate: 72%
### High Priority
1. **[instructions]** Add explicit error handling for missing git config
Evidence: eval-003 fails because the skill doesn't check for git user.name
2. **[description]** Add "conventional commit" as trigger phrase
Evidence: Skill not selected when user says "make a conventional commit"
### Medium Priority
3. **[examples]** Add breaking change example to execution steps
Evidence: eval-004 inconsistently handles breaking changes
### Low Priority
4. **[structure]** Move flag reference to Quick Reference table
Evidence: Flags scattered across multiple sections
If --apply is NOT set, stop here.
Delta-verify gate (AEGIS source-cases — required before any apply)
Before any edit is written to the live SKILL.md — both the plain --apply
path (Step 5) and the --best-of path (Step 5a) — confirm the edit actually
shrinks the source-failure set captured in Step 1, not merely that the
overall golden-set pass rate is higher. Ranking by aggregate pass rate can
reward a candidate that fixes unrelated cases while leaving the motivating
failures broken; this gate closes that gap (HarnessX/AEGIS: re-run on the
source cases, confirm the failure count shrinks before applying).
Run the gate against the drafted candidate (a candidate file under
eval-results/candidates/, or for plain --apply a draft written there
first), never the live SKILL.md:
- Re-run only the source-failure cases against the candidate — spawn one
Task subagent (
subagent_type: general-purpose) per case with the candidate
content as the skill context (the same rollout machinery as Step 5a; use
prepare_run.sh), and grade each transcript with
python3 evaluate-plugin/scripts/grade_deterministic.py.
- Compute
delta = (source failures before) − (source failures after).
- Gate: apply only when
delta > 0 (the candidate fixes at least one
motivating failure and regresses none of the others). When delta <= 0, do
not write the edit — report which source cases still fail and suggest
revising the suggestions. --force-apply overrides the gate (records the
override in history). When the source-failure set is empty, the gate is a
no-op and the apply proceeds.
Step 5: Apply changes (if --apply)
Use AskUserQuestion to let the user select which suggestions to apply:
Which improvements should I apply?
[x] Add error handling for missing git config
[x] Add trigger phrases to description
[ ] Add breaking change example
[ ] Restructure flag reference
If --best-of N with N > 1, follow Step 5a to pick the winning revision
first, then continue with the apply flow below using the winner's content.
Draft the approved edits into a candidate file and run them through the
Delta-verify gate above. Only proceed to write the live SKILL.md when the
gate passes (or --force-apply is set). For each approved suggestion:
- Read the current SKILL.md
- Apply the change using Edit
- Update the
modified date in frontmatter
Step 5a: Generate and rank candidates (if --best-of N > 1)
Instead of drafting the approved edits once, generate N alternative drafts and
let evaluation pick the winner.
-
Generate candidates. Write N complete candidate revisions of the
SKILL.md to <plugin>/skills/<skill>/eval-results/candidates/candidate-<i>.md
(the eval-results/ tree is gitignored). Each candidate implements the
approved suggestions with a genuinely different strategy — different
instruction placement, phrasing, or example choice — not paraphrases of one
draft.
-
Rank with real grading when evals exist. If the skill has evals.json:
- For each candidate, run one pass per eval case: spawn a Task subagent
(
subagent_type: general-purpose) that receives the candidate
content as the skill context and executes the eval prompt (mirrors
/evaluate:skill Step 4; use prepare_run.sh for the run directories).
- Grade each transcript with
python3 evaluate-plugin/scripts/grade_deterministic.py — typed checks
grade for zero judge tokens; defer fuzzy assertions to the eval-grader
agent.
- Rank candidates by source-failure delta first (how many of the Step 1
source-failure cases each candidate fixes — the Delta-verify gate signal),
then by mean golden-set pass rate, so a candidate that lifts the aggregate
while leaving the motivating failures broken never wins. Break remaining
ties with the
eval-comparator agent: blind pairwise comparison of the
tied candidates' transcripts. Discard any candidate with delta <= 0
unless --force-apply is set.
-
Fall back to blind self-preference when no evals exist. Without
evals.json there are no prompts to roll out. Rank via the
eval-comparator agent — pairwise, candidates presented as Output A/B,
the analyzer's weakness list passed as the assertions. Flag this in the
report as a weaker signal and suggest re-running /evaluate:skill with
--create-evals first.
-
Apply the winner through the Step 5 apply flow, and record the ranking
in the history entry (Step 5b below): a candidates array with each
candidate's id, pass rate (or comparison score), and a selected flag.
Token cost is bounded at N × eval cases × 1 run plus grading; treat --best-of
without a number as N=3. Prefer this mode for skills that have evals.json —
deterministic ranking of real rollouts is the point; text-only self-preference
is the fallback.
Step 5b: Record history
After applying changes, update (or create) the history file at:
<plugin-name>/skills/<skill-name>/eval-results/history.json
Add a new iteration entry recording:
- Version number (increment from previous)
- Timestamp
- Pass rate from current benchmark
- Summary of changes made
- Delta-verify result:
source_failures_before, source_failures_after, and the resulting source_failure_delta (and whether --force-apply overrode a non-positive delta)
- Candidate ranking when
--best-of was used: a candidates array of {id, pass_rate, source_failure_delta, selected} (use comparison_score instead of pass_rate for the no-evals fallback)
Step 6: Suggest re-evaluation
After applying changes, suggest:
Changes applied. Run `/evaluate:skill <plugin/skill-name>` to measure improvement.
Agentic Optimizations
| Context | Command |
|---|
| Read benchmark | cat <plugin>/skills/<skill>/eval-results/benchmark.json | jq .summary |
| Read skill | cat <plugin>/skills/<skill>/SKILL.md |
| Read history | cat <plugin>/skills/<skill>/eval-results/history.json | jq '.iterations[-1]' |
| Check pass rate | cat <plugin>/skills/<skill>/eval-results/benchmark.json | jq '.summary.with_skill.mean_pass_rate' |
| Source-failure set | cat <plugin>/skills/<skill>/eval-results/benchmark.json | jq -r '[.cases[] | select(.with_skill.passed == false) | .id]' |
Quick Reference
| Flag | Description |
|---|
--apply | Apply approved changes to SKILL.md |
--description-only | Focus on description improvements only |
--best-of N | Generate N candidate revisions, rank by source-failure delta then pass rate, apply winner |
--force-apply | Apply even when the delta-verify gate shows the edit does not shrink the source-failure set |