一键在 Manus 中运行任何 Skill

eval-runner

星标1

分支1

更新时间2026年3月10日 10:04

Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

dlabs

dlabs/claude-marketplace

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

Eval Runner

Run evaluation suites to benchmark and validate blueprint-dev skills. Evals test that skills produce the expected behavior — correct tool usage, proper classification, adherence to constraints.

How It Works

Read eval definitions from ${CLAUDE_SKILL_DIR}/../evals/
For each eval suite, load the eval.yaml config
For each prompt in the suite, spawn a test agent with the skill loaded
Collect responses and check against criteria
Produce a benchmark report

Eval Structure

evals/
  config.yaml              # Global config (priority, classification)
  <skill-name>/
    eval.yaml              # Eval definition with prompts + criteria
    prompts/
      <test-name>.md       # Individual test prompts
    fixtures/              # Optional test data
      <file>

Running Evals

Single Skill

/bp:eval agent-browser

All Skills

/bp:eval all

By Priority

/bp:eval --priority high

Eval Definition Format

Each eval.yaml defines:

name: skill-name
skill: skill-name
classification: capability_uplift | encoded_preference

prompts:
  - name: test-name
    file: prompts/test-name.md
    fixtures:              # Optional
      - fixtures/file.json
    criteria:
      - type: contains | not_contains | matches_regex
        value: "expected string or pattern"
        reason: "Why this matters"

Criteria Types

Type	Description
`contains`	Response must contain the value
`not_contains`	Response must NOT contain the value
`matches_regex`	Response must match the regex pattern

Execution Strategy

Parallel by default: Each prompt runs as an independent agent
Skill loaded: The target skill is loaded into the test agent's context
No side effects: Evals should not modify files or run destructive commands
Timeout: 30 seconds per prompt (configurable in config.yaml)

Report Format

=== Blueprint-Dev Eval Report ===

agent-browser (Capability Uplift - HIGH)
  [PASS] navigate-and-click (3/3 criteria)
  [PASS] fill-form (3/3 criteria)
  [PASS] anti-playwright (4/4 criteria)

git-worktree (Capability Uplift - HIGH)
  [PASS] create-worktree (3/3 criteria)
  [FAIL] anti-raw-git (2/3 criteria)
    - FAILED: not_contains "git worktree add"

Summary: 4/5 passed (80%)

Skill Classifications

Capability Uplift: Teaches techniques Claude doesn't know natively. Evals focus on correct tool/pattern usage. Strict pass criteria.
Encoded Preference: Encodes team/project preferences. Evals focus on classification accuracy and compliance. Standard pass criteria.

When to Run Evals

After modifying a skill's SKILL.md
After updating skill descriptions
Before version bumps
As part of CI for the plugin repository

同仓库更多 Skills

同仓库

ab-testing

dlabs/claude-marketplace

Production A/B testing lifecycle for design variants. Covers hypothesis formation, feature flags, variant comparison, analytics tracking, statistical significance analysis, experiment setup, and cleanup.

2026-03-101

agent-browser

dlabs/claude-marketplace

Browser automation using Vercel's agent-browser CLI. Use when you need to interact with web pages, fill forms, take screenshots, or scrape data. Uses Bash commands with ref-based element selection. Triggers on "browse website", "fill form", "click button", "take screenshot", "scrape page", "web automation".

2026-03-101

architecture-review

dlabs/claude-marketplace

Multi-agent architecture review combining core architecture design with parallel security, performance, and data integrity assessments. Produces ADRs in MADR format. Covers ADR, architecture decision, system design, scalability assessment. Not for code review or implementation — for architectural decisions only.

2026-03-101

batch-integration

dlabs/claude-marketplace

Reference for how the built-in /batch command integrates with blueprint-dev workflows — parallel codebase-wide changes using worktrees with project context.

2026-03-101

claude-md-learning

dlabs/claude-marketplace

Analyzes detected stack profiles and suggests targeted CLAUDE.md improvements. Covers CLAUDE.md improvement, project configuration, AI instructions. Never auto-writes to CLAUDE.md — stages suggestions for user review.

2026-03-101

compound-knowledge

dlabs/claude-marketplace

Problem documentation methodology for compounding team knowledge. Captures solved problems with structured metadata for searchability, pattern detection, and prevention. Covers postmortem, lessons learned, debugging documentation, solved problem capture. Not for general documentation — specifically for post-debugging problem capture.

2026-03-101

name	eval-runner
description	Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes.
disable-model-invocation	true
argument-hint	[skill-name\|all]

Eval Runner

Run evaluation suites to benchmark and validate blueprint-dev skills. Evals test that skills produce the expected behavior — correct tool usage, proper classification, adherence to constraints.

How It Works

Read eval definitions from ${CLAUDE_SKILL_DIR}/../evals/
For each eval suite, load the eval.yaml config
For each prompt in the suite, spawn a test agent with the skill loaded
Collect responses and check against criteria
Produce a benchmark report

Eval Structure

evals/
  config.yaml              # Global config (priority, classification)
  <skill-name>/
    eval.yaml              # Eval definition with prompts + criteria
    prompts/
      <test-name>.md       # Individual test prompts
    fixtures/              # Optional test data
      <file>

Running Evals

Single Skill

/bp:eval agent-browser

All Skills

/bp:eval all

By Priority

/bp:eval --priority high

Eval Definition Format

Each eval.yaml defines:

name: skill-name
skill: skill-name
classification: capability_uplift | encoded_preference

prompts:
  - name: test-name
    file: prompts/test-name.md
    fixtures:              # Optional
      - fixtures/file.json
    criteria:
      - type: contains | not_contains | matches_regex
        value: "expected string or pattern"
        reason: "Why this matters"

Criteria Types

Type	Description
`contains`	Response must contain the value
`not_contains`	Response must NOT contain the value
`matches_regex`	Response must match the regex pattern

Execution Strategy

Parallel by default: Each prompt runs as an independent agent
Skill loaded: The target skill is loaded into the test agent's context
No side effects: Evals should not modify files or run destructive commands
Timeout: 30 seconds per prompt (configurable in config.yaml)

Report Format

=== Blueprint-Dev Eval Report ===

agent-browser (Capability Uplift - HIGH)
  [PASS] navigate-and-click (3/3 criteria)
  [PASS] fill-form (3/3 criteria)
  [PASS] anti-playwright (4/4 criteria)

git-worktree (Capability Uplift - HIGH)
  [PASS] create-worktree (3/3 criteria)
  [FAIL] anti-raw-git (2/3 criteria)
    - FAILED: not_contains "git worktree add"

Summary: 4/5 passed (80%)

Skill Classifications

Capability Uplift: Teaches techniques Claude doesn't know natively. Evals focus on correct tool/pattern usage. Strict pass criteria.
Encoded Preference: Encodes team/project preferences. Evals focus on classification accuracy and compliance. Standard pass criteria.

When to Run Evals

After modifying a skill's SKILL.md
After updating skill descriptions
Before version bumps
As part of CI for the plugin repository