con un clic
eval-harness
// Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.
// Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.
Audit AGENTS.md / CLAUDE.md against the five high-leverage patterns (progressive disclosure, procedural workflows, decision tables, production code examples, domain rules with concrete alternatives). Reports per-pattern coverage, anti-patterns, and a prioritized fix list.
Shell 脚本中 awk 的 POSIX 兼容性指南。 Use when: 编写或审查包含 awk 的 shell 脚本, 尤其是需要 macOS + Linux 跨平台运行的场景。 触发词: awk, BSD awk, POSIX regex, [[:space:]], guard 脚本, 跨平台 shell
Iterative retrieval — 4-stage loop (DISPATCH→EVALUATE→REFINE→LOOP) to pinpoint relevant information in the code base. Up to 3 rounds.
Strategic compression — Manual compression of contexts at logical boundaries rather than arbitrary automatic compression. Key decisions and constraints are preserved and intermediate exploration is discarded.
Post-hoc diagnosis of a failed agent trajectory. Classifies the first unrecoverable step into one of nine failure categories (plan adherence, hallucinated information, invalid tool call, misread tool output, intent–plan mismatch, under-specified intent, unsupported intent, guardrail trigger, system failure) and produces an evidence-backed root-cause report.
AI-assisted development of anti-hallucination specifications. Check out the seven-layer defense architecture, quantitative indicators, execution templates and practical cases. Used for code review, task startup inspection, and weekly review.
| name | eval-harness |
| description | Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader. |
Evaluation-driven development: Not just "can the code run", but quantify "how good is the code".
| Grader | Check content | Pass conditions |
|---|---|---|
| Compilation check | Whether the code can be compiled / type check passed | Zero errors |
| Test check | Whether all tests passed | Full green |
| Lint check | Whether the code style conforms to the specification | Zero warnings (or only allowed warnings) |
| Coverage check | Check whether the test coverage reaches the standard | ≥ 80% |
| Grader | Check content | How to grade |
|---|---|---|
| Code review | Code quality, readability, security | 0-10 points |
| Requirements matching | Whether the implementation meets the requirements | 0-1 matching degree |
| Architecture evaluation | Is the design reasonable | 0-10 points |
Define Evaluation Criteria
Run the evaluation
Analysis results
Improvements
guards/<lang>/check_*.sh)vibeguard/rules/security.mdvibeguard/rules/universal.md