eval-builder

Name: Eval Builder
Author: brainqub3

// Build or repair task evaluators and evaluator tests with deterministic-first strategy.

Run Skill in Manus

$ git log --oneline --stat

stars:16

forks:6

updated:February 9, 2026 at 14:19

File Explorer

7 files

SKILL.md

readonly

name	eval-builder
description	Build or repair task evaluators and evaluator tests with deterministic-first strategy.
disable-model-invocation	true
allowed-tools	["Read","Edit","Bash","Glob","Grep"]

eval-builder

Use this skill when evaluator is missing, incomplete, or failing tests.

Goal

Produce:

brainqub3/tasks/<task>/evaluator.py
brainqub3/tasks/<task>/tests/test_evaluator.py
Minimal fixtures/instances for deterministic assertions
Updated task.md output contract

Strategy Priority

Deterministic programmatic checks
JSON schema + explicit constraints
Simulator/replay checks
Fuzzy but programmatic heuristics
LLM judge as last resort

Workflow

Read task.md and instances.jsonl
Identify output contract and success criteria
Implement evaluator with explicit failure taxonomy
Add tests for:
- clear pass
- clear fail
- malformed output
Run pytest brainqub3/tasks/<task>/tests -q

related-skills.json

same repository

scenario.md

from "brainqub3/agent-labs"

Create and execute scenario YAML files for architecture what-if predictions.

2026-02-1216

arena.md

from "brainqub3/agent-labs"

Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling.

2026-02-1116

cleanup-reset.md

from "brainqub3/agent-labs"

Reset local experiment state to a fresh slate by clearing runs, prediction outputs, scaling snapshots, and database index files under data/.

2026-02-1116

task-author.md

from "brainqub3/agent-labs"

Create or repair Brainqub3 task packages that must pass evaluator tests before runs, including both fabricated instances and user-provided data workflows.

2026-02-1016

report.md

from "brainqub3/agent-labs"

Generate markdown reports with observed metrics, predicted metrics, and architecture recommendations.

2026-02-0916

package.json