Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

evaluator

Estrellas5

Forks3

Actualizado29 de marzo de 2026, 02:48

Grade implementation work against bead acceptance criteria using a separate judge agent. Use after subagent work passes mechanical gates, as a pre-merge check, or on-demand to evaluate existing features. The evaluator is NOT the orchestrator and NOT the implementer — it only judges. Integrates with browser-qa for runtime verification when CDT MCP is available.

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

rbergman

rbergman/dark-matter-marketplace

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Ocupaciones relacionadasSOC

Basado en la clasificación ocupacional SOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas·SOC 15-1253

SKILL.md

readonly

name	evaluator
description	Grade implementation work against bead acceptance criteria using a separate judge agent. Use after subagent work passes mechanical gates, as a pre-merge check, or on-demand to evaluate existing features. The evaluator is NOT the orchestrator and NOT the implementer — it only judges. Integrates with browser-qa for runtime verification when CDT MCP is available.

Evaluator Protocol

Separate the agent doing work from the agent judging it. This is more tractable than making one agent self-critical.

When to Invoke

The orchestrator calls the evaluator in these situations:

Post-subagent, pre-merge — after implementation passes mechanical gates AND either:
- Browser-qa is available (CDT MCP connected + app running) — runtime evaluation
- Acceptance criteria require runtime testing ("user can...", "page shows...", "form validates...")
On-demand — evaluate <bead-id> to test an existing feature against its criteria
Post-merge — /dm-work:post-merge runs evaluator against closed beads

Skip evaluator when:

XS/S tasks without acceptance criteria
Intent review returned COVERAGE: full, DRIFT: none, GAPS: none AND no browser-qa available
Bead has no acceptance criteria (report to orchestrator, don't run empty evaluation)
Task has no bead (ad-hoc work)

The intent review and evaluator have complementary scope:

Intent review checks CODE COVERAGE — does the diff contain the right changes?
Evaluator checks BEHAVIORAL CORRECTNESS — does the running app satisfy each criterion?

If no runtime testing is possible, the evaluator's value over intent review is minimal. Skip it.

Evaluator Agent Template

Task(subagent_type="general-purpose", model="opus", description="Evaluate against acceptance criteria", prompt="
# Use model="haiku" for code-only evaluation with simple criteria (no browser-qa)
ROLE: Evaluator. You judge work against acceptance criteria. You do NOT implement or fix.

BEAD: <id>
ACCEPTANCE CRITERIA (from bead --design field):
<numbered list of criteria>

CODE DIFF:
<git diff output or summary of changes>

EVALUATION PROCESS:

1. Classify each criterion:
   - RUNTIME: requires browser interaction to verify ("user can...", "page shows...", "form validates...")
   - CODE: verifiable from code inspection ("function exists", "type is correct", "test passes")

2. If browser-qa available (CDT MCP connected, app running at <url>):
   - Activate dm-work:browser-qa
   - For each RUNTIME criterion: navigate, interact, assert
   - For each CODE criterion: inspect the diff

3. If browser-qa NOT available:
   - For each CODE criterion: inspect the diff
   - For each RUNTIME criterion: mark UNTESTABLE with reason
   - If ALL criteria are UNTESTABLE: return early with overall: SKIP

4. Grade each criterion: PASS / FAIL / UNTESTABLE
   - PASS: criterion is satisfied (code or runtime evidence)
   - FAIL: criterion is not satisfied (describe what's wrong)
   - UNTESTABLE: cannot verify without runtime / missing prerequisite

SKILLS: dm-work:browser-qa (if CDT MCP available)

OUTPUT FORMAT (JSON to stdout):
{
  \"bead_id\": \"<id>\",
  \"criteria_results\": [
    {
      \"criterion\": 1,
      \"text\": \"User can navigate to /settings\",
      \"type\": \"RUNTIME\",
      \"result\": \"PASS\",
      \"detail\": \"Navigated to /settings, page loads with profile form visible\"
    },
    {
      \"criterion\": 2,
      \"text\": \"Email validates client-side\",
      \"type\": \"RUNTIME\",
      \"result\": \"FAIL\",
      \"detail\": \"Entered invalid email 'notanemail', no validation error shown\"
    }
  ],
  \"overall\": \"FAIL\",
  \"pass_count\": 1,
  \"fail_count\": 1,
  \"untestable_count\": 0,
  \"summary\": \"1/2 criteria pass. Email validation missing on client side.\"
}

RULES:
- Judge ONLY against the listed acceptance criteria. Do not invent requirements.
- PASS means the criterion is satisfied, not that the code is perfect.
- Report what you observed, not what you assumed.
- If a criterion is ambiguous, grade it and note the ambiguity in detail.
- Do NOT modify code, commit, or close beads.
")

Handling Evaluator Results

The orchestrator processes evaluator output:

overall: PASS → proceed to merge overall: SKIP → all criteria untestable, proceed (evaluator adds no value here) overall: FAIL →

Check fail count vs total:
- 1-2 failures: send FAIL details back to original subagent for targeted fix
- 50% failures: likely a spec problem — escalate to user, don't iterate
Check if failures are criteria bugs (criterion is impossible/ambiguous):
- If so, update the bead criteria, don't blame the implementation

Create beads for persistent failures:

bd create --title="Eval: <failed criterion>" --type=bug --priority=2
bd dep add <new-bead> discovered-from:<parent-bead>

Circuit breaker: If evaluator fails twice on the same criterion after rework, escalate to user. Don't loop.

Cost and Timing

Evaluator adds ~1-2 minutes per invocation (code-only) or ~2-4 minutes (with browser-qa)
Skip aggressively when not needed (see skip conditions above)
Use haiku model for code-only evaluation if criteria are simple
Use opus for browser-qa evaluation (needs to drive CDT tools effectively)

Platform-Specific Verification

Not all projects use browser-qa. The evaluator should adapt:

Project type	Verification method	Evaluator behavior
Standard web app	browser-qa (CDT MCP)	Full runtime evaluation
WebGL / Canvas game	Manual screenshots + human verification	Mark runtime criteria UNTESTABLE; take screenshots if CDT available for visual reference, but can't assert on canvas content
Native iOS/Android	Maestro or platform-specific tools	Mark runtime criteria UNTESTABLE unless project has automated UI test tooling wired
CLI tool	Bash execution + output assertion	Code-only evaluation; test commands via bash, not browser
API / backend	curl / httpie + response assertion	Code-only for endpoints; evaluate_script or direct API calls

When runtime verification isn't possible, the evaluator should:

Grade all code-verifiable criteria normally
Mark platform-specific criteria as UNTESTABLE with the reason and recommended verification method
Include a note: "Manual verification recommended for: [list criteria]"

Integration Points

Component	How evaluator connects
Orchestrator	Calls evaluator as Step 1.5 in post-subagent verification
Browser-qa	Evaluator activates browser-qa skill for standard web apps
Beads	Reads acceptance criteria from bead; files new beads for failures
Sprint contracts	Acceptance criteria in bead ARE the sprint contract
Post-merge review	Post-merge command uses evaluator for closed beads
Intent review	Complementary: intent checks code coverage, evaluator checks behavior

Más de este repositorio

mismo repositorio

lead

rbergman/dark-matter-marketplace

Activate at session start when using Agent Teams for complex multi-agent work. Establishes team lead role with delegation protocols, teammate spawning, model selection, and beads integration. You coordinate the team; teammates implement.

2026-05-285

repo-init

rbergman/dark-matter-marketplace

Initialize a new repository with standard scaffolding - git, gitignore, AGENTS.md, justfile, mise, beads, and timbers. Use when starting a new project or setting up an existing repo for Claude Code workflows.

2026-05-285

accessibility-design

rbergman/dark-matter-marketplace

Use when designing any player-facing feature, evaluating a game for accessibility, responding to accessibility feedback, designing difficulty or assist options, adding subtitle/caption systems, implementing input remapping, or when a player reports they can't play. Covers the four accessibility pillars (visual, auditory, motor, cognitive), implementation tiers, colorblind design, subtitle standards, input accessibility, and testing methodology. Accessibility is a design discipline, not a post-launch checklist.

2026-05-035

experience-design

rbergman/dark-matter-marketplace

Engagement loop design, pacing frameworks, the Experience Triangle (mechanics + dynamics + aesthetics), emotion layering across a session, and evaluating whether choices feel meaningful. Use when designing the core loop, structuring an emotional arc across 5-30 min sessions, debugging 'feels flat' or 'feels samey' play, evaluating whether decisions matter, planning peaks and valleys of intensity, or when playtesters describe sessions as 'fine but forgettable.' Sits one level above game-design (mechanic-level) and one below game-vision (north-star-level).

2026-05-035

game-balance

rbergman/dark-matter-marketplace

Numeric balance across game objects with stats — cost curves, transitive vs intransitive systems, dominant strategy detection, sandbagging signals, and anti-degenerate-strategy levers. Use when designing item or weapon stats, pricing storefronts, combat damage/HP/range numbers, upgrade trees, character classes, faction asymmetry, or anywhere two objects have numeric attributes that should relate fairly. Apply when playtesters say 'X is just better,' 'one path always wins,' or 'I never use Y.' Pairs with economy-design (currency flow) and progression-systems (curves over time).

2026-05-035

game-design

rbergman/dark-matter-marketplace

The mechanic-level evaluation toolkit — apply the 5-Component Framework (Clarity, Motivation, Response, Satisfaction, Fit) to any individual mechanic. Use when designing or evaluating a single mechanic, reviewing whether a feature pulls its weight, debugging why a specific action feels off, comparing alternative implementations of the same mechanic, or doing a first-pass critique on a player-facing feature. For session-level pacing or emotional arcs see experience-design; for moment-to-moment juice see game-feel; for system-of-systems interactions see systems-design. This skill is the per-mechanic lens.

2026-05-035

name	evaluator
description	Grade implementation work against bead acceptance criteria using a separate judge agent. Use after subagent work passes mechanical gates, as a pre-merge check, or on-demand to evaluate existing features. The evaluator is NOT the orchestrator and NOT the implementer — it only judges. Integrates with browser-qa for runtime verification when CDT MCP is available.

Evaluator Protocol

Separate the agent doing work from the agent judging it. This is more tractable than making one agent self-critical.

When to Invoke

The orchestrator calls the evaluator in these situations:

Post-subagent, pre-merge — after implementation passes mechanical gates AND either:
- Browser-qa is available (CDT MCP connected + app running) — runtime evaluation
- Acceptance criteria require runtime testing ("user can...", "page shows...", "form validates...")
On-demand — evaluate <bead-id> to test an existing feature against its criteria
Post-merge — /dm-work:post-merge runs evaluator against closed beads

Skip evaluator when:

XS/S tasks without acceptance criteria
Intent review returned COVERAGE: full, DRIFT: none, GAPS: none AND no browser-qa available
Bead has no acceptance criteria (report to orchestrator, don't run empty evaluation)
Task has no bead (ad-hoc work)

The intent review and evaluator have complementary scope:

Intent review checks CODE COVERAGE — does the diff contain the right changes?
Evaluator checks BEHAVIORAL CORRECTNESS — does the running app satisfy each criterion?

If no runtime testing is possible, the evaluator's value over intent review is minimal. Skip it.

Evaluator Agent Template

Task(subagent_type="general-purpose", model="opus", description="Evaluate against acceptance criteria", prompt="
# Use model="haiku" for code-only evaluation with simple criteria (no browser-qa)
ROLE: Evaluator. You judge work against acceptance criteria. You do NOT implement or fix.

BEAD: <id>
ACCEPTANCE CRITERIA (from bead --design field):
<numbered list of criteria>

CODE DIFF:
<git diff output or summary of changes>

EVALUATION PROCESS:

1. Classify each criterion:
   - RUNTIME: requires browser interaction to verify ("user can...", "page shows...", "form validates...")
   - CODE: verifiable from code inspection ("function exists", "type is correct", "test passes")

2. If browser-qa available (CDT MCP connected, app running at <url>):
   - Activate dm-work:browser-qa
   - For each RUNTIME criterion: navigate, interact, assert
   - For each CODE criterion: inspect the diff

3. If browser-qa NOT available:
   - For each CODE criterion: inspect the diff
   - For each RUNTIME criterion: mark UNTESTABLE with reason
   - If ALL criteria are UNTESTABLE: return early with overall: SKIP

4. Grade each criterion: PASS / FAIL / UNTESTABLE
   - PASS: criterion is satisfied (code or runtime evidence)
   - FAIL: criterion is not satisfied (describe what's wrong)
   - UNTESTABLE: cannot verify without runtime / missing prerequisite

SKILLS: dm-work:browser-qa (if CDT MCP available)

OUTPUT FORMAT (JSON to stdout):
{
  \"bead_id\": \"<id>\",
  \"criteria_results\": [
    {
      \"criterion\": 1,
      \"text\": \"User can navigate to /settings\",
      \"type\": \"RUNTIME\",
      \"result\": \"PASS\",
      \"detail\": \"Navigated to /settings, page loads with profile form visible\"
    },
    {
      \"criterion\": 2,
      \"text\": \"Email validates client-side\",
      \"type\": \"RUNTIME\",
      \"result\": \"FAIL\",
      \"detail\": \"Entered invalid email 'notanemail', no validation error shown\"
    }
  ],
  \"overall\": \"FAIL\",
  \"pass_count\": 1,
  \"fail_count\": 1,
  \"untestable_count\": 0,
  \"summary\": \"1/2 criteria pass. Email validation missing on client side.\"
}

RULES:
- Judge ONLY against the listed acceptance criteria. Do not invent requirements.
- PASS means the criterion is satisfied, not that the code is perfect.
- Report what you observed, not what you assumed.
- If a criterion is ambiguous, grade it and note the ambiguity in detail.
- Do NOT modify code, commit, or close beads.
")

Handling Evaluator Results

The orchestrator processes evaluator output:

overall: PASS → proceed to merge overall: SKIP → all criteria untestable, proceed (evaluator adds no value here) overall: FAIL →

Check fail count vs total:
- 1-2 failures: send FAIL details back to original subagent for targeted fix
- 50% failures: likely a spec problem — escalate to user, don't iterate
Check if failures are criteria bugs (criterion is impossible/ambiguous):
- If so, update the bead criteria, don't blame the implementation

Create beads for persistent failures:

bd create --title="Eval: <failed criterion>" --type=bug --priority=2
bd dep add <new-bead> discovered-from:<parent-bead>

Circuit breaker: If evaluator fails twice on the same criterion after rework, escalate to user. Don't loop.

Cost and Timing

Evaluator adds ~1-2 minutes per invocation (code-only) or ~2-4 minutes (with browser-qa)
Skip aggressively when not needed (see skip conditions above)
Use haiku model for code-only evaluation if criteria are simple
Use opus for browser-qa evaluation (needs to drive CDT tools effectively)

Platform-Specific Verification

Not all projects use browser-qa. The evaluator should adapt:

Project type	Verification method	Evaluator behavior
Standard web app	browser-qa (CDT MCP)	Full runtime evaluation
WebGL / Canvas game	Manual screenshots + human verification	Mark runtime criteria UNTESTABLE; take screenshots if CDT available for visual reference, but can't assert on canvas content
Native iOS/Android	Maestro or platform-specific tools	Mark runtime criteria UNTESTABLE unless project has automated UI test tooling wired
CLI tool	Bash execution + output assertion	Code-only evaluation; test commands via bash, not browser
API / backend	curl / httpie + response assertion	Code-only for endpoints; evaluate_script or direct API calls

When runtime verification isn't possible, the evaluator should:

Grade all code-verifiable criteria normally
Mark platform-specific criteria as UNTESTABLE with the reason and recommended verification method
Include a note: "Manual verification recommended for: [list criteria]"

Integration Points

Component	How evaluator connects
Orchestrator	Calls evaluator as Step 1.5 in post-subagent verification
Browser-qa	Evaluator activates browser-qa skill for standard web apps
Beads	Reads acceptance criteria from bead; files new beads for failures
Sprint contracts	Acceptance criteria in bead ARE the sprint contract
Post-merge review	Post-merge command uses evaluator for closed beads
Intent review	Complementary: intent checks code coverage, evaluator checks behavior