with one click
skill-eval
// Evaluate skills: trigger testing, A/B benchmarks, structure validation.
// Evaluate skills: trigger testing, A/B benchmarks, structure validation.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | skill-eval |
| description | Evaluate skills: trigger testing, A/B benchmarks, structure validation. |
| user-invocable | false |
| argument-hint | <skill-name> |
| allowed-tools | ["Read","Write","Bash","Grep","Glob","Agent"] |
| routing | {"triggers":["improve skill","test skill","eval skill","benchmark skill","skill triggers","skill quality","self-improve skill","skill self-improvement","improve skill with variants"],"pairs_with":["agent-evaluation","verification-before-completion"],"complexity":"Medium-Complex","category":"meta"} |
Measure and improve skill quality through empirical testing — because structure doesn't guarantee behavior, and measurement beats assumption.
| Signal | Load These Files | Why |
|---|---|---|
| tasks related to this reference | schemas.md | Loads detailed guidance from schemas.md. |
| tasks related to this reference | self-improve-loop.md | Loads detailed guidance from self-improve-loop.md. |
Step 1: Identify the skill
# Validate skill structure first
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
This checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.
Step 2: Choose evaluation mode based on user intent
| Intent | Mode | Script |
|---|---|---|
| "Test if description triggers correctly" | Trigger eval | run_eval.py |
| "Optimize/improve the description through autoresearch" | Route to agent-comparison | optimize_loop.py |
| "Compare skill vs no-skill output" | Output benchmark | Manual + aggregate_benchmark.py |
| "Validate skill structure" | Quick validate | quick_validate.py |
| "Self-improve skill" / "optimize skill" / "improve skill with A/B" | Self-improvement loop | references/self-improve-loop.md |
GATE: Skill path confirmed, mode selected.
Test whether a skill's description causes Claude to invoke it for the right queries.
Step 1: Create eval set (or use existing)
Create a JSON file with 8-20 test queries. Eval set quality matters — use realistic prompts with detail (file paths, context, casual phrasing), not abstract one-liners. Focus on edge cases where the skill competes with adjacent skills.
Example of good eval queries:
[
{"query": "ok so my boss sent me this xlsx file (Q4 sales final FINAL v2.xlsx) and she wants profit margin as a percentage", "should_trigger": true},
{"query": "Format this data", "should_trigger": false}
]
Why: Real users write detailed, specific prompts. Abstract queries don't test real triggering behavior. Overfitting descriptions to abstract test cases bloats the description and fails on real usage.
Step 2: Run evaluation
python3 -m scripts.skill_eval.run_eval \
--eval-set evals.json \
--skill-path <path/to/skill> \
--runs-per-query 3 \
--verbose
This spawns claude -p for each query, checking whether it invokes the skill. Runs each query 3 times for reliability. Output includes pass/fail per query with trigger rates. Default 30s timeout; increase with --timeout 60 if needed for complex queries.
Constraints applied:
GATE: Eval results available. Proceed to improvement if failures found.
Automated loop that tests, improves, and re-tests descriptions using Claude with extended thinking.
python3 -m scripts.skill_eval.run_loop \
--eval-set evals.json \
--skill-path <path/to/skill> \
--max-iterations 5 \
--verbose
This will:
claude -p to propose improvements based on training failuresWhy 60/40 split: Improvements should help across many prompts, not just test cases. Training on failures, validating on holdout ensures generalization.
Why report HTML: Visual reports enable quick review of which queries improved, which regressed, and what the new description looks like.
GATE: Loop complete. Best description identified.
Compare skill quality by running prompts with and without the skill.
Step 1: Create test prompts — 2-3 realistic user prompts
Step 2: Run with-skill and without-skill in parallel subagents:
For each test prompt, spawn two agents:
Why baseline matters: Can't prove the skill adds value without a baseline. Maybe Claude handles it fine without the skill. The delta is what matters.
Step 3: Grade outputs
Spawn a grader subagent using agents/grader.md. It evaluates assertions against the outputs.
Step 4: Aggregate
python3 -m scripts.skill_eval.aggregate_benchmark <workspace>/iteration-1 --skill-name <name>
Produces benchmark.json and benchmark.md with pass rates, timing, and token usage.
Step 5: Analyze (optional)
For blind comparison, use agents/comparator.md to judge outputs without knowing which skill produced them. Then use agents/analyzer.md to understand why the winner won.
GATE: Benchmark results available.
python3 -m scripts.skill_eval.quick_validate <path/to/skill>
Checks: SKILL.md exists, valid frontmatter, required fields (name, description), kebab-case naming, description under 1024 chars, no angle brackets.
Automatically generate variants of a skill, A/B test them against the original, and promote winners. This is a closed-loop pipeline — baseline, hypothesize, generate, test, promote.
Read the full protocol: ${CLAUDE_SKILL_DIR}/references/self-improve-loop.md
The loop runs 5 phases: BASELINE (establish metrics with 3+ test cases), HYPOTHESIZE (2-3 single-variable changes), GENERATE VARIANTS (minimal diffs), BLIND A/B TEST (paired comparison via agents/comparator.md), PROMOTE OR KEEP (60%+ win rate required, no regressions). All outcomes — wins and losses — are recorded to the learning DB to prevent re-testing failed hypotheses.
GATE: Self-improvement protocol loaded from reference. Proceed through the 5 phases.
Step 1: Review results
For trigger eval / description optimization:
For output benchmark:
Step 2: Apply changes (with user confirmation)
If description optimization found a better description:
Constraint: Always show results before/after with metrics. This enables informed decisions.
GATE: Changes applied and validated, or user chose to keep original.
Cause: Skill path doesn't point to a valid skill directory
Solution: Verify path contains a SKILL.md file. Skills must follow the skill-name/SKILL.md structure.
Cause: Claude CLI not available for trigger evaluation
Solution: Install Claude Code CLI. Trigger eval requires claude -p to test skill invocation.
Cause: Outdated instructions or an old checkout still expects a direct SDK client
Solution: Update to the current scripts. Description optimization now runs through claude -p.
Cause: Running eval from inside a Claude Code session blocks nested instances
Solution: The scripts automatically strip the CLAUDECODE env var. If issues persist, run from a separate terminal.
Cause: Default 30s timeout too short for complex queries
Solution: Increase with --timeout 60. Simple trigger queries should complete in <15s.
scripts/skill_eval/)run_eval.py — Trigger evaluation: tests description against query setrun_loop.py — Eval+improve loop: automated description optimizationimprove_description.py — Single-shot description improvement via Claude APIgenerate_report.py — HTML report from loop outputaggregate_benchmark.py — Benchmark aggregation from grading resultsquick_validate.py — Structural validation of SKILL.mdskills/skill-eval/agents/)grader.md — Evaluates assertions against execution outputscomparator.md — Blind A/B comparison of two outputsanalyzer.md — Post-hoc analysis of why one version beat another${CLAUDE_SKILL_DIR}/references/schemas.md — JSON schemas for evals.json, grading.json, benchmark.json${CLAUDE_SKILL_DIR}/references/self-improve-loop.md — Self-improvement loop protocol: variant generation, blind A/B testing, promotion criteria