| name | eval |
| description | Eval-Driven Development (EDD) for AI workflows. pass@k metrics, capability evals, regression evals. Triggers: eval, edd, pass@k, capability, regression, benchmark. |
| allowed-tools | Read, Bash, Write, Edit, Grep, Glob |
AgentDB read-start has run. Check for existing eval definitions in _meta/research/.
Understand what behavior you're evaluating before writing evals.
Skill-specific: skills/eval/reference/eval-research.md
<core_principles>
- DEFINE BEFORE CODE: Evals written first force clear thinking about success criteria.
- CODE GRADERS > MODEL GRADERS: Deterministic checks beat probabilistic judgments.
- STRUCTURAL SEPARATION FOR HIGH-STAKES: When stakes are real (security, payments, eval-of-evals, agent quality scoring), use the blind-evaluator agent — never self-score. Self-scoring inflates results ~36% structurally; procedural separation ("I won't peek") does not fix it.
- TRACK PASS@K: pass@1 (first attempt), pass@3 (within 3 attempts). Target pass@3 > 90%.
- REGRESSION BEFORE SHIP: Every change must pass existing evals before merge.
- FAST EVALS GET RUN: Slow evals get skipped. Keep evaluation fast.
</core_principles>
1. DEFINE: Write eval criteria before implementation. (gate: criteria exist in writing before any code)
2. IMPLEMENT: Code to pass defined evals.
3. EVALUATE: Run evals, record pass@k. (gate: pass@3 > 90% for capability; pass^3 = 100% for regression)
4. REPORT: Document results in eval report format. See reference for template.
<blind_evaluation_protocol>
Use when implementing agent would otherwise score its own output (high-stakes: security, payments, agent quality):
- Spawn
agents/blind-evaluator.md as a fresh agent.
- Pass ONLY: problem statement, rubric (3-7 criteria with PASS conditions + weights), artifact path.
- Do NOT pass: implementer's checkpoint, summary, commit message, prompt, or expected solution.
- (gate: blind evaluator runs contamination check — if forbidden inputs detected, returns INVALID; clean inputs and retry)
- (gate: confidence < 0.7 from blind evaluator → escalate to human grader)
Two-phase eval protocol:
- Run 1: implementing agent solves cold, no eval feedback. Blind evaluator scores. This is the externally-reportable number.
- Run 2: implementing agent gets Run 1 score + rubric breakdown, then optimizes. For iteration only.
</blind_evaluation_protocol>
pass@k: "At least one success in k attempts"
- pass@1: First attempt success rate
- pass@3: Success within 3 attempts (typical target: > 90%)
pass^k: "All k trials succeed"
- pass^3: 3 consecutive successes
- Use for critical paths (auth, payments)
See reference for calculation formula and worked examples.
<grader_selection>
- Code-based (preferred): grep, test suite, build, type-check — deterministic, fast.
- Model-based: for open-ended outputs that can't be checked deterministically. Run multiple times, take majority.
- Human: required for security-sensitive changes, UX evaluation, legal/compliance.
See reference for full grader templates and examples.
</grader_selection>
<anti_patterns>
Writing evals after implementation tests existing bugs, not requirements.
Model-based grading is slow and probabilistic. Prefer code graders.
Every change must pass regression evals. No exceptions.
Evals that take > 30s get skipped. Keep them fast.
Track pass@k over time. Declining reliability is a signal.
For any user-facing or high-stakes eval, the implementing agent scoring its own work inflates results ~36%. Spawn blind-evaluator instead.
Evaluating against a codebase that already contains the canonical solution = answer key in the eval set. Use pre-merge snapshots or a separate fixture.
Greenfield tickets in the golden eval set collapse to self=10, blind=3. Greenfields are not evaluable as solved tasks — exclude them from the dataset.
Optimizing how much context the evaluator gets before establishing a baseline score = can't distinguish signal from noise. Run minimal-context baseline first, then test additions one at a time.
</anti_patterns>
<on_complete>
agentdb write-end '{"skill":"eval","eval_type":"capability|regression","pass_at_1":"<X%>","pass_at_3":"<Y%>","failures":[""]}'
Record eval type, pass rates, and any failures for future reference.
</on_complete>