تشغيل أي مهارة في Manus بنقرة واحدة

eval

النجوم١١

التفرعات٠

آخر تحديث٢٨ مايو ٢٠٢٦ في ٢٠:٣٦

Eval-Driven Development (EDD) for AI workflows. pass@k metrics, capability evals, regression evals. Triggers: eval, edd, pass@k, capability, regression, benchmark.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

ariaxhan

ariaxhan/kernel-claude

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مطوّرو البرمجياتمهن الحاسوب والرياضيات·SOC 15-1252

مستكشف الملفات

2 ملفات

SKILL.md

readonly

name	eval
description	Eval-Driven Development (EDD) for AI workflows. pass@k metrics, capability evals, regression evals. Triggers: eval, edd, pass@k, capability, regression, benchmark.
allowed-tools	Read, Bash, Write, Edit, Grep, Glob

AgentDB read-start has run. Check for existing eval definitions in _meta/research/. Understand what behavior you're evaluating before writing evals. Skill-specific: skills/eval/reference/eval-research.md

<core_principles>

DEFINE BEFORE CODE: Evals written first force clear thinking about success criteria.
CODE GRADERS > MODEL GRADERS: Deterministic checks beat probabilistic judgments.
STRUCTURAL SEPARATION FOR HIGH-STAKES: When stakes are real (security, payments, eval-of-evals, agent quality scoring), use the blind-evaluator agent — never self-score. Self-scoring inflates results ~36% structurally; procedural separation ("I won't peek") does not fix it.
TRACK PASS@K: pass@1 (first attempt), pass@3 (within 3 attempts). Target pass@3 > 90%.
REGRESSION BEFORE SHIP: Every change must pass existing evals before merge.
FAST EVALS GET RUN: Slow evals get skipped. Keep evaluation fast. </core_principles>

1. DEFINE: Write eval criteria before implementation. (gate: criteria exist in writing before any code) 2. IMPLEMENT: Code to pass defined evals. 3. EVALUATE: Run evals, record pass@k. (gate: pass@3 > 90% for capability; pass^3 = 100% for regression) 4. REPORT: Document results in eval report format. See reference for template.

<blind_evaluation_protocol> Use when implementing agent would otherwise score its own output (high-stakes: security, payments, agent quality):

Spawn agents/blind-evaluator.md as a fresh agent.
Pass ONLY: problem statement, rubric (3-7 criteria with PASS conditions + weights), artifact path.
Do NOT pass: implementer's checkpoint, summary, commit message, prompt, or expected solution.
(gate: blind evaluator runs contamination check — if forbidden inputs detected, returns INVALID; clean inputs and retry)
(gate: confidence < 0.7 from blind evaluator → escalate to human grader)

Two-phase eval protocol:

Run 1: implementing agent solves cold, no eval feedback. Blind evaluator scores. This is the externally-reportable number.
Run 2: implementing agent gets Run 1 score + rubric breakdown, then optimizes. For iteration only. </blind_evaluation_protocol>

pass@k: "At least one success in k attempts" - pass@1: First attempt success rate - pass@3: Success within 3 attempts (typical target: > 90%)

pass^k: "All k trials succeed"

pass^3: 3 consecutive successes
Use for critical paths (auth, payments)

See reference for calculation formula and worked examples.

<grader_selection>

Code-based (preferred): grep, test suite, build, type-check — deterministic, fast.
Model-based: for open-ended outputs that can't be checked deterministically. Run multiple times, take majority.
Human: required for security-sensitive changes, UX evaluation, legal/compliance.

See reference for full grader templates and examples. </grader_selection>

<anti_patterns> Writing evals after implementation tests existing bugs, not requirements. Model-based grading is slow and probabilistic. Prefer code graders. Every change must pass regression evals. No exceptions. Evals that take > 30s get skipped. Keep them fast. Track pass@k over time. Declining reliability is a signal. For any user-facing or high-stakes eval, the implementing agent scoring its own work inflates results ~36%. Spawn blind-evaluator instead. Evaluating against a codebase that already contains the canonical solution = answer key in the eval set. Use pre-merge snapshots or a separate fixture. Greenfield tickets in the golden eval set collapse to self=10, blind=3. Greenfields are not evaluable as solved tasks — exclude them from the dataset. Optimizing how much context the evaluator gets before establishing a baseline score = can't distinguish signal from noise. Run minimal-context baseline first, then test additions one at a time. </anti_patterns>

<on_complete> agentdb write-end '{"skill":"eval","eval_type":"capability|regression","pass_at_1":"<X%>","pass_at_3":"<Y%>","failures":[""]}'

Record eval type, pass rates, and any failures for future reference. </on_complete>

المزيد من هذا المستودع

نفس المستودع

build

ariaxhan/kernel-claude

Solution exploration and implementation. Generate 2-3 approaches, pick simplest. Never implement first idea. Triggers: build, implement, create, feature, add.

2026-06-2611

debug

ariaxhan/kernel-claude

Systematic debugging methodology based on Zeller's scientific method. Reproduce, hypothesize, predict, test, isolate via binary search, fix root cause. Every fix gets a regression test. Triggers: bug, error, fix, broken, not working, fails, crashed, unexpected, stack trace, regression, exception.

2026-06-2611

testing

ariaxhan/kernel-claude

Testing methodology and strategy. Tests prove behavior, not implementation. Triggers: test, tests, coverage, assertion, mock, fixture, spec, verify.

2026-06-2511

git

ariaxhan/kernel-claude

Git workflow and version control best practices. Atomic commits, conventional messages, branch strategies, merge protocols. Triggers: commit, branch, merge, rebase, git, push, pull, PR, version control.

2026-06-2411

security

ariaxhan/kernel-claude

Security best practices and vulnerability prevention. Input validation, authentication, secrets management, OWASP top 10. Triggers: security, auth, authentication, secrets, credentials, vulnerability, injection, XSS, CSRF, rate-limit.

2026-06-2211

api

ariaxhan/kernel-claude

REST API design patterns. Resource naming, status codes, pagination, error responses, versioning. Triggers: api, rest, endpoint, route, http, status-code, pagination.

2026-06-1811

name	eval
description	Eval-Driven Development (EDD) for AI workflows. pass@k metrics, capability evals, regression evals. Triggers: eval, edd, pass@k, capability, regression, benchmark.
allowed-tools	Read, Bash, Write, Edit, Grep, Glob

<core_principles>

DEFINE BEFORE CODE: Evals written first force clear thinking about success criteria.
CODE GRADERS > MODEL GRADERS: Deterministic checks beat probabilistic judgments.
STRUCTURAL SEPARATION FOR HIGH-STAKES: When stakes are real (security, payments, eval-of-evals, agent quality scoring), use the blind-evaluator agent — never self-score. Self-scoring inflates results ~36% structurally; procedural separation ("I won't peek") does not fix it.
TRACK PASS@K: pass@1 (first attempt), pass@3 (within 3 attempts). Target pass@3 > 90%.
REGRESSION BEFORE SHIP: Every change must pass existing evals before merge.
FAST EVALS GET RUN: Slow evals get skipped. Keep evaluation fast. </core_principles>

<blind_evaluation_protocol> Use when implementing agent would otherwise score its own output (high-stakes: security, payments, agent quality):

Spawn agents/blind-evaluator.md as a fresh agent.
Pass ONLY: problem statement, rubric (3-7 criteria with PASS conditions + weights), artifact path.
Do NOT pass: implementer's checkpoint, summary, commit message, prompt, or expected solution.
(gate: blind evaluator runs contamination check — if forbidden inputs detected, returns INVALID; clean inputs and retry)
(gate: confidence < 0.7 from blind evaluator → escalate to human grader)

Two-phase eval protocol:

Run 1: implementing agent solves cold, no eval feedback. Blind evaluator scores. This is the externally-reportable number.
Run 2: implementing agent gets Run 1 score + rubric breakdown, then optimizes. For iteration only. </blind_evaluation_protocol>

pass@k: "At least one success in k attempts" - pass@1: First attempt success rate - pass@3: Success within 3 attempts (typical target: > 90%)

pass^k: "All k trials succeed"

pass^3: 3 consecutive successes
Use for critical paths (auth, payments)

See reference for calculation formula and worked examples.

<grader_selection>

Code-based (preferred): grep, test suite, build, type-check — deterministic, fast.
Model-based: for open-ended outputs that can't be checked deterministically. Run multiple times, take majority.
Human: required for security-sensitive changes, UX evaluation, legal/compliance.

See reference for full grader templates and examples. </grader_selection>

<on_complete> agentdb write-end '{"skill":"eval","eval_type":"capability|regression","pass_at_1":"<X%>","pass_at_3":"<Y%>","failures":[""]}'

Record eval type, pass rates, and any failures for future reference. </on_complete>