with one click
eval-skills
// Run structured evaluations on skills to measure quality and track improvements.
// Run structured evaluations on skills to measure quality and track improvements.
Run the full development pipeline autonomously without pausing between phases. Stops only on quality-gate failures.
Run test coverage measurement, analyze results, and fix gaps when coverage falls below the 80% threshold.
Fetch unresolved PR review threads, triage them, implement fixes, validate, reply in-thread, and resolve.
Stage, commit, push, and open a GitHub PR following project conventions. Use when code is ready to ship.
Review local code changes for correctness, regressions, missing tests, and Databao-specific risks.
Ensure a YouTrack issue exists before starting work. Validates existing tickets or creates new ones.
| name | eval-skills |
| description | Run structured evaluations on skills to measure quality and track improvements. |
| argument-hint | [skill-name ...] (e.g. local-code-review review-architecture) |
If names provided via $ARGUMENTS, evaluate those. Otherwise list skills
with evals/evals.json files and ask user to pick (accept "all").
mkdir -p .claude/evals-workspace/iteration-<N>
Use next sequential number.
For each test case in evals.json, run twice:
iteration-<N>/<skill>-<id>/with_skill/outputs/iteration-<N>/<skill>-<id>/without_skill/outputs/Each run starts with clean context.
Evaluate assertions against output. Save grading.json:
{
"assertion_results": [{"text": "...", "passed": true, "evidence": "..."}],
"summary": {"passed": 3, "failed": 1, "total": 4, "pass_rate": 0.75}
}
Require concrete evidence for every PASS.
Save iteration-<N>/benchmark.json with mean pass rates (with/without skill) and delta.
Show per-eval pass rates, overall delta, always-pass candidates (remove?),
always-fail candidates (revise?). Save feedback to feedback.json.
Update SKILL.md based on findings, run new iteration, compare benchmarks, stop when pass rates plateau.