| name | skill-tester |
| description | Test skills end-to-end: design test cases, run with/without skill, grade results,
test description triggering accuracy, produce improvement report.
Use when: "протестируй скилл", "запусти тесты для скилла", "проверь скилл",
"run skill tests", "test this skill", "skill eval", "оцени скилл",
"придумай тесты для скилла", "создай сценарии тестирования"
|
Skill Tester
Design tests, run them, grade results, evaluate description triggering, produce
actionable report. All in one workflow — no separate test-design step needed.
You are team lead, test designer, user actor, and analyst — all in one.
Phase 1: Understand & Design
1a. Read the target skill
- User provides skill name or path
- Read the target skill's SKILL.md + ALL referenced files completely
- Map out:
- Skill type: procedural / informational
- Input: what does the skill expect? (user message, task file, structured data)
- Output: what should the skill produce? (files, messages, actions, decisions)
- Phases: list all phases/steps with their checkpoints
- References: list all files the skill tells agents to read
- Decision points: where does the skill branch based on input?
- Dialogue points: where does the skill ask the user questions?
1b. Design test prompts
Design test prompts applying criteria from test-design-guide.md
(realistic prompts, assertion design, persona setup):
- Propose 2-3 test prompts:
- 1 happy-path: the most common, standard use case
- 1-2 edge cases: where the skill might break or behave unexpectedly
- For each prompt, propose assertions — binary, observable checks.
Categories:
- [Process]: Did the agent follow the skill's workflow?
- [Outcome]: Is the result correct and complete?
- [Compliance]: Did the agent obey all skill instructions?
1c. Design trigger eval queries
Design trigger eval queries applying patterns from trigger-eval-guide.md:
- Generate 15-20 trigger eval queries:
- 8-10 should-trigger: varied phrasings of the skill's intended use
- 8-10 should-not-trigger: near-misses that share keywords but need
something different
- These will be used in Phase 4 to test description accuracy
1d. Confirm with user
Present the full test plan:
- Test prompts with assertions
- Trigger eval queries
- Proposed model for runners
- Persona (use default, modify only if user requests)
"Here are the test cases and trigger queries I've prepared. Do these look
right, or do you want to adjust anything?"
Wait for confirmation before proceeding.
Checkpoint: User confirmed test plan. All prompts have assertions.
Trigger eval queries prepared.
Phase 2: Execute Tests
2a. Setup
- Create workspace:
~/.claude/skill-tests/{skill-name}/iteration-{N}/
(N = 1 for first run, increment for re-runs)
- Save test plan to workspace:
evals.json with all prompts and assertions
trigger-evals.json with all trigger queries
- TeamCreate(team_name="skill-test-{skill-name}")
- Plan runners: per scenario = 2 with-skill + 1 baseline without skill
Show plan to user: "I'll run {N} scenarios, {M} runners total.
Model: {model}. Proceed?"
2b. Spawn runners
For each scenario, spawn all runners in parallel:
With-skill runners (2 per scenario):
- Prompt = scenario's task prompt (natural, as user would write)
- Each runner loads the tested skill:
Skill(skill="{tested-skill-name}")
- Model: as confirmed with user
- Use
run_in_background: true
Baseline runner (1 per scenario):
- Same task prompt, same model
- Receives no skill to load
- Use
run_in_background: true
Save each runner's task_id — needed for grader agents to retrieve transcripts.
Scenarios run sequentially. Runners within a scenario run in parallel.
2c. Interact as user persona
If runners send questions, answer in character per the scenario's persona.
Rules:
- Stay in character: answer as the user would
- Be consistent: same question from different runners → same answer
- Answer naturally — without guidance toward any specific behavior
- Keep conversation purely about the task itself
- Baseline runner may ask different questions (no skill to guide it) —
this is expected, answer them too
2d. Capture timing data
When each runner completes, immediately save timing data:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
This is the only opportunity to capture this data — it comes through the
task notification and isn't persisted elsewhere. Process each notification
as it arrives rather than trying to batch them.
Save to timing.json in each runner's result directory.
Checkpoint: All runners completed. Timing data captured.
Phase 3: Grade & Analyze
3a. Grade via grader agents
When all runners for a scenario finish, spawn grader agents — one per runner.
Delegate transcript analysis to grader agents — transcripts are large and
reading them directly would exhaust the lead agent's context, leaving no room
for report compilation.
Each grader receives instructions from
grading-guide.md and:
- The runner's task_id (grader calls
TaskOutput(task_id) for transcript)
- The scenario's assertions (copy the criteria list into the prompt)
- The skill's SKILL.md path (grader reads it for compliance check)
- Whether this is a skill-runner or baseline
Spawn all graders in parallel. Wait for all to return.
3b. Compile results per scenario
Using only grader outputs (not transcripts):
- Build results table (assertions × runners)
- Cross-runner consistency: where did skill-runners diverge?
- Divergence on a criterion = ambiguous instruction in the skill
- Baseline comparison:
- Passed by skill-runners ONLY → skill adds value
- Passed by ALL → criterion too easy or skill doesn't help here
- Failed by ALL → criterion may be unrealistic
- Passed by baseline ONLY → skill might be harmful for this case
Clean up runners for this scenario before moving to the next one.
3c. Benchmark aggregation
Across all scenarios, compute:
- Pass rate per assertion per config (with-skill / baseline)
- Timing comparison: tokens and duration per config
- Overall skill value: how many assertions improve vs baseline
3d. Analyst pass
Surface patterns the aggregate stats might hide:
- Non-discriminating assertions: pass regardless of whether skill is
used. These don't prove the skill helps — consider removing or replacing
with harder assertions.
- High-variance assertions: one skill-runner passes, other fails on
same criterion. Usually means the skill's instruction is ambiguous —
identify the specific instruction and quote it.
- Time/token tradeoffs: skill adds value but costs 2x tokens? Flag it.
The user should know the cost of improvement.
- Repeated code in transcripts: if multiple runners independently wrote
similar helper scripts, flag this as a candidate for bundling in
the skill's
scripts/ directory.
Checkpoint: All scenarios graded. Benchmark computed. Analyst observations
recorded.
Phase 4: Test Description Triggering
4a. Evaluate trigger accuracy
For each trigger eval query from trigger-evals.json:
- Assess whether the skill's current description would cause Claude to
invoke the skill for this query
- Consider: does the query's intent match the description's keywords and
contexts? Would Claude see this as the skill's domain?
Categorize each query:
- True positive: should-trigger → would trigger
- True negative: should-not-trigger → would not trigger
- False negative: should-trigger → would NOT trigger (undertriggering)
- False positive: should-not-trigger → would trigger (overtriggering)
4b. Calculate trigger accuracy
Trigger accuracy = (true positives + true negatives) / total queries
False negative rate = false negatives / should-trigger queries
False positive rate = false positives / should-not-trigger queries
False negatives (undertriggering) are the most costly — users won't discover
the skill exists. False positives waste time but are less harmful.
4c. Suggest improved description
If trigger accuracy < 85% or false negative rate > 20%:
- Analyze which queries fail and why
- Draft an improved description that would trigger correctly
- Show before/after comparison in the report
Checkpoint: Trigger accuracy calculated. Description improvement
suggested if needed.
Phase 5: Report
Structure the report according to
report-template.md.
The report includes:
- Results per scenario — assertions × runners table with evidence
- Skill compliance — phase-by-phase execution check
- Benchmark summary — pass_rate, tokens, time per config
- Analyst observations — non-discriminating, high-variance, cost analysis
- Description trigger accuracy — accuracy metrics + suggested improvement
- Scripts to bundle — if repeated code found across transcripts
- Recommendations — priority-ordered specific fixes for skill-master
Save to: ~/.claude/skill-tests/{skill-name}/reports/{timestamp}-report.md
Show report to user: "Here's the test report. Key findings: [summary].
The report is at [path] — you can share it with skill-master to apply fixes."
TeamDelete after report delivery.
Improving the Skill (Iteration)
If the user wants to iterate after receiving the report:
- User (or skill-master) applies fixes to the skill
- Run skill-tester again → results go to
iteration-{N+1}/
- Previous iteration results are available for comparison
- Report shows delta: what improved, what regressed
When iterating, keep these principles in mind:
- Generalize from feedback: resist fiddly changes targeted at specific
test cases. If a skill works only for its test cases, it's useless at scale.
- Keep the prompt lean: read transcripts. If the skill makes the model
waste time doing unproductive things, remove those parts.
- Explain the why: rather than adding rigid ALWAYS/NEVER rules, explain
reasoning so the model understands the intent.
Self-Verification