com um clique
eval-harness-skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Menu
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Universal coding standards, best practices, and patterns. Use when developing in any language — triggers on TypeScript, JavaScript, React, Node.js, Python, Nix, ruff, pyright, pytest, uv, flake.nix, justfile, just, recipes, and general code quality topics.
Use when writing git commit messages, reviewing commits, or setting up commit conventions. Triggers on commit, git commit, commit message, changelog, semantic versioning.
Audit NixOS impermanence configuration — find files on root filesystem not covered by persistence declarations. Use when the user wants to check for untracked files, audit impermanence, or runs /impermanence-audit.
Docker-in-Docker with network_mode host for multi-node simulation
Use when implementing LangGraph workflows that need to pause for user input or external confirmation before continuing execution
Workaround for @nuxt/eslint not auto-detecting TypeScript, causing vue-eslint-parser to fail on <script lang="ts"> blocks
| name | Eval Harness Skill |
| description | A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles. |
| license | Complete terms in LICENSE.txt |
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
Eval-Driven Development treats evals as the "unit tests of AI development":
Test if Claude can do something it couldn't before:
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
Ensure changes don't break existing functionality:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
Deterministic checks using code:
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
Use Claude to evaluate open-ended outputs:
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
Flag for manual review:
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
"At least one success in k attempts"
"All k trials succeed"
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
Write code to pass the defined evals.
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
/eval define feature-name
Creates eval definition file at .claude/evals/feature-name.md
/eval check feature-name
Runs current evals and reports status
/eval report feature-name
Generates full eval report
Store evals in project:
.claude/
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Eval run history
baseline.json # Regression baselines
## EVAL: add-authentication
### Phase 1: Define (10 min)
Capability Evals:
- [ ] User can register with email/password
- [ ] User can login with valid credentials
- [ ] Invalid credentials rejected with proper error
- [ ] Sessions persist across page reloads
- [ ] Logout clears session
Regression Evals:
- [ ] Public routes still accessible
- [ ] API responses unchanged
- [ ] Database schema compatible
### Phase 2: Implement (varies)
[Write code]
### Phase 3: Evaluate
Run: /eval check add-authentication
### Phase 4: Report
EVAL REPORT: add-authentication
==============================
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT