ワンクリックで
eval-skills
// Run structured evaluations on skills to measure quality and track improvements.
// Run structured evaluations on skills to measure quality and track improvements.
| name | eval-skills |
| description | Run structured evaluations on skills to measure quality and track improvements. |
| argument-hint | [skill-name ...] (e.g. local-code-review review-architecture) |
If names provided via $ARGUMENTS, evaluate those. Otherwise list skills
with evals/evals.json files and ask user to pick (accept "all").
mkdir -p .claude/evals-workspace/iteration-<N>
Use next sequential number.
For each test case in evals.json, run twice:
iteration-<N>/<skill>-<id>/with_skill/outputs/iteration-<N>/<skill>-<id>/without_skill/outputs/Each run starts with clean context.
Evaluate assertions against output. Save grading.json:
{
"assertion_results": [{"text": "...", "passed": true, "evidence": "..."}],
"summary": {"passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75}
}
Require concrete evidence for every PASS.
Save iteration-<N>/benchmark.json with mean pass rates (with/without skill) and delta.
Show per-eval pass rates, overall delta, always-pass candidates (remove?),
always-fail candidates (revise?). Save feedback to feedback.json.
Update SKILL.md based on findings, run new iteration, compare benchmarks, stop when pass rates plateau.
Run the full development pipeline autonomously without pausing between phases. Stops only on quality-gate failures.
Run test coverage measurement, analyze results, and fix gaps when coverage falls below the 80% threshold.
Fetch unresolved PR review threads, triage them, implement fixes, validate, reply in-thread, and resolve.
Stage, commit, push, and open a GitHub PR following project conventions. Use when code is ready to ship.
Review local code changes for correctness, regressions, missing tests, and Databao-specific risks.
Ensure a YouTrack issue exists before starting work. Validates existing tickets or creates new ones.