Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

eval-harness

Name: Eval Harness
Author: majiayu000

// Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.

Ejecutar en Manus

$ git log --oneline --stat

stars:20

forks:2

updated:22 de mayo de 2026, 12:10

SKILL.md

readonly

name	eval-harness
description	Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.

Eval Harness

Overview

Evaluation-driven development: Not just "can the code run", but quantify "how good is the code".

When to Activate

A guard, hook, workflow, or skill needs measurable repair/regression evidence before adoption.
A user asks whether an agent workflow is actually improving quality rather than only passing one example.
A change affects scoring, grading, or benchmark thresholds.

Red Flags

Only one hand-picked success case exists and there is no held-out or unrelated task coverage.
A probabilistic grader is used where a deterministic build, test, lint, or coverage check is available.
pass@k improves while pass^k or unrelated-task behavior regresses.

Checklist

Define deterministic checks before adding model-judged grading.
Record target and unrelated scenarios with before/after outcomes.
Report pass@k, pass^k, and regression counts separately.

Core indicators

pass@k (single success rate)

Generate k candidate solutions, with a probability of at least 1 passing
Used to evaluate the completion quality of a single task
Target: pass@1 > 80%

pass^k (continuous success rate)

The probability of passing all k consecutive tasks at once
Used to evaluate overall workflow reliability
Goal: pass^5 > 50% (pass all 5 consecutive tasks in one go)

Grader type

Code Basics Grader (deterministic)

Grader	Check content	Pass conditions
Compilation check	Whether the code can be compiled / type check passed	Zero errors
Test check	Whether all tests passed	Full green
Lint check	Whether the code style conforms to the specification	Zero warnings (or only allowed warnings)
Coverage check	Check whether the test coverage reaches the standard	≥ 80%

Model base Grader (probabilistic)

Grader	Check content	How to grade
Code review	Code quality, readability, security	0-10 points
Requirements matching	Whether the implementation meets the requirements	0-1 matching degree
Architecture evaluation	Is the design reasonable	0-10 points

Usage process

Define Evaluation Criteria
- Extract verifiable passing conditions from requirements
- Choose the right grader combination
Run the evaluation
- Code base Grader first (fast, deterministic)
- Model basics Grader supplement (depth, probabilistic)
Analysis results
- pass@1 < 80% → Unclear requirements or problematic implementation strategies
- pass^5 < 50% → There is a systemic problem with the workflow
Improvements
- Adjust strategies based on failure patterns
- Updated Grader rules

VibeGuard Integration

Code base Grader can reuse guard script output (such as guards/<lang>/check_*.sh)
Security Grader reference vibeguard/rules/security.md
Quality Grader reference vibeguard/rules/universal.md

related-skills.json

mismo repositorio

agentsmd-audit.md

from "majiayu000/vibeguard"

Audit AGENTS.md / CLAUDE.md against the five high-leverage patterns (progressive disclosure, procedural workflows, decision tables, production code examples, domain rules with concrete alternatives). Reports per-pattern coverage, anti-patterns, and a prioritized fix list.

2026-05-2220

awk-posix-compat.md

from "majiayu000/vibeguard"

Shell 脚本中 awk 的 POSIX 兼容性指南。 Use when: 编写或审查包含 awk 的 shell 脚本，尤其是需要 macOS + Linux 跨平台运行的场景。触发词: awk, BSD awk, POSIX regex, [[:space:]], guard 脚本, 跨平台 shell

2026-05-2220

iterative-retrieval.md

from "majiayu000/vibeguard"

Iterative retrieval — 4-stage loop (DISPATCH→EVALUATE→REFINE→LOOP) to pinpoint relevant information in the code base. Up to 3 rounds.

2026-05-2220

strategic-compact.md

from "majiayu000/vibeguard"

Strategic compression — Manual compression of contexts at logical boundaries rather than arbitrary automatic compression. Key decisions and constraints are preserved and intermediate exploration is discarded.

2026-05-2220

trajectory-review.md

from "majiayu000/vibeguard"

Post-hoc diagnosis of a failed agent trajectory. Classifies the first unrecoverable step into one of nine failure categories (plan adherence, hallucinated information, invalid tool call, misread tool output, intent–plan mismatch, under-specified intent, unsupported intent, guardrail trigger, system failure) and produces an evidence-backed root-cause report.

2026-05-2220

vibeguard.md

from "majiayu000/vibeguard"

AI-assisted development of anti-hallucination specifications. Check out the seven-layer defense architecture, quantitative indicators, execution templates and practical cases. Used for code review, task startup inspection, and weekly review.

2026-05-2220

package.json

"author": "majiayu000"

"repository": "majiayu000/vibeguard"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas15-1253L4

name	eval-harness
description	Assessment-driven development — Quantify code generation quality with pass@k / pass^k metrics, automatically scored by Grader.

Eval Harness

Overview

Evaluation-driven development: Not just "can the code run", but quantify "how good is the code".

When to Activate

A guard, hook, workflow, or skill needs measurable repair/regression evidence before adoption.
A user asks whether an agent workflow is actually improving quality rather than only passing one example.
A change affects scoring, grading, or benchmark thresholds.

Red Flags

Only one hand-picked success case exists and there is no held-out or unrelated task coverage.
A probabilistic grader is used where a deterministic build, test, lint, or coverage check is available.
pass@k improves while pass^k or unrelated-task behavior regresses.

Checklist

Define deterministic checks before adding model-judged grading.
Record target and unrelated scenarios with before/after outcomes.
Report pass@k, pass^k, and regression counts separately.

Core indicators

pass@k (single success rate)

Generate k candidate solutions, with a probability of at least 1 passing
Used to evaluate the completion quality of a single task
Target: pass@1 > 80%

pass^k (continuous success rate)

The probability of passing all k consecutive tasks at once
Used to evaluate overall workflow reliability
Goal: pass^5 > 50% (pass all 5 consecutive tasks in one go)

Grader type

Code Basics Grader (deterministic)

Grader	Check content	Pass conditions
Compilation check	Whether the code can be compiled / type check passed	Zero errors
Test check	Whether all tests passed	Full green
Lint check	Whether the code style conforms to the specification	Zero warnings (or only allowed warnings)
Coverage check	Check whether the test coverage reaches the standard	≥ 80%

Model base Grader (probabilistic)

Grader	Check content	How to grade
Code review	Code quality, readability, security	0-10 points
Requirements matching	Whether the implementation meets the requirements	0-1 matching degree
Architecture evaluation	Is the design reasonable	0-10 points

Usage process

Define Evaluation Criteria
- Extract verifiable passing conditions from requirements
- Choose the right grader combination
Run the evaluation
- Code base Grader first (fast, deterministic)
- Model basics Grader supplement (depth, probabilistic)
Analysis results
- pass@1 < 80% → Unclear requirements or problematic implementation strategies
- pass^5 < 50% → There is a systemic problem with the workflow
Improvements
- Adjust strategies based on failure patterns
- Updated Grader rules

VibeGuard Integration

Code base Grader can reuse guard script output (such as guards/<lang>/check_*.sh)
Security Grader reference vibeguard/rules/security.md
Quality Grader reference vibeguard/rules/universal.md

eval-harness

Eval Harness

Overview

When to Activate

Red Flags

Checklist

Core indicators

pass@k (single success rate)

pass^k (continuous success rate)

Grader type

Code Basics Grader (deterministic)

Model base Grader (probabilistic)

Usage process

VibeGuard Integration

Más de este repositorio

Más de este repositorio

Eval Harness

Overview

When to Activate

Red Flags

Checklist

Core indicators

pass@k (single success rate)

pass^k (continuous success rate)

Grader type

Code Basics Grader (deterministic)

Model base Grader (probabilistic)

Usage process

VibeGuard Integration