تشغيل أي مهارة في Manus بنقرة واحدة

eval-runner

النجوم١

التفرعات١

آخر تحديث١٠ مارس ٢٠٢٦ في ١٠:٠٤

Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

dlabs

dlabs/claude-marketplace

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

محللو ضمان جودة البرمجيات والمختبرونمهن الحاسوب والرياضيات·SOC 15-1253

SKILL.md

readonly

name	eval-runner
description	Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes.
disable-model-invocation	true
argument-hint	[skill-name\|all]

Eval Runner

Run evaluation suites to benchmark and validate blueprint-dev skills. Evals test that skills produce the expected behavior — correct tool usage, proper classification, adherence to constraints.

How It Works

Read eval definitions from ${CLAUDE_SKILL_DIR}/../evals/
For each eval suite, load the eval.yaml config
For each prompt in the suite, spawn a test agent with the skill loaded
Collect responses and check against criteria
Produce a benchmark report

Eval Structure

evals/
  config.yaml              # Global config (priority, classification)
  <skill-name>/
    eval.yaml              # Eval definition with prompts + criteria
    prompts/
      <test-name>.md       # Individual test prompts
    fixtures/              # Optional test data
      <file>

Running Evals

Single Skill

/bp:eval agent-browser

All Skills

/bp:eval all

By Priority

/bp:eval --priority high

Eval Definition Format

Each eval.yaml defines:

name: skill-name
skill: skill-name
classification: capability_uplift | encoded_preference

prompts:
  - name: test-name
    file: prompts/test-name.md
    fixtures:              # Optional
      - fixtures/file.json
    criteria:
      - type: contains | not_contains | matches_regex
        value: "expected string or pattern"
        reason: "Why this matters"

Criteria Types

Type	Description
`contains`	Response must contain the value
`not_contains`	Response must NOT contain the value
`matches_regex`	Response must match the regex pattern

Execution Strategy

Parallel by default: Each prompt runs as an independent agent
Skill loaded: The target skill is loaded into the test agent's context
No side effects: Evals should not modify files or run destructive commands
Timeout: 30 seconds per prompt (configurable in config.yaml)

Report Format

=== Blueprint-Dev Eval Report ===

agent-browser (Capability Uplift - HIGH)
  [PASS] navigate-and-click (3/3 criteria)
  [PASS] fill-form (3/3 criteria)
  [PASS] anti-playwright (4/4 criteria)

git-worktree (Capability Uplift - HIGH)
  [PASS] create-worktree (3/3 criteria)
  [FAIL] anti-raw-git (2/3 criteria)
    - FAILED: not_contains "git worktree add"

Summary: 4/5 passed (80%)

Skill Classifications

Capability Uplift: Teaches techniques Claude doesn't know natively. Evals focus on correct tool/pattern usage. Strict pass criteria.
Encoded Preference: Encodes team/project preferences. Evals focus on classification accuracy and compliance. Standard pass criteria.

When to Run Evals

After modifying a skill's SKILL.md
After updating skill descriptions
Before version bumps
As part of CI for the plugin repository

المزيد من هذا المستودع

نفس المستودع

ab-testing

dlabs/claude-marketplace

Production A/B testing lifecycle for design variants. Covers hypothesis formation, feature flags, variant comparison, analytics tracking, statistical significance analysis, experiment setup, and cleanup.

2026-03-101

agent-browser

dlabs/claude-marketplace

Browser automation using Vercel's agent-browser CLI. Use when you need to interact with web pages, fill forms, take screenshots, or scrape data. Uses Bash commands with ref-based element selection. Triggers on "browse website", "fill form", "click button", "take screenshot", "scrape page", "web automation".

2026-03-101

architecture-review

dlabs/claude-marketplace

Multi-agent architecture review combining core architecture design with parallel security, performance, and data integrity assessments. Produces ADRs in MADR format. Covers ADR, architecture decision, system design, scalability assessment. Not for code review or implementation — for architectural decisions only.

2026-03-101

batch-integration

dlabs/claude-marketplace

Reference for how the built-in /batch command integrates with blueprint-dev workflows — parallel codebase-wide changes using worktrees with project context.

2026-03-101

claude-md-learning

dlabs/claude-marketplace

Analyzes detected stack profiles and suggests targeted CLAUDE.md improvements. Covers CLAUDE.md improvement, project configuration, AI instructions. Never auto-writes to CLAUDE.md — stages suggestions for user review.

2026-03-101

compound-knowledge

dlabs/claude-marketplace

Problem documentation methodology for compounding team knowledge. Captures solved problems with structured metadata for searchability, pattern detection, and prevention. Covers postmortem, lessons learned, debugging documentation, solved problem capture. Not for general documentation — specifically for post-debugging problem capture.

2026-03-101

name	eval-runner
description	Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes.
disable-model-invocation	true
argument-hint	[skill-name\|all]

Eval Runner

Run evaluation suites to benchmark and validate blueprint-dev skills. Evals test that skills produce the expected behavior — correct tool usage, proper classification, adherence to constraints.

How It Works

Read eval definitions from ${CLAUDE_SKILL_DIR}/../evals/
For each eval suite, load the eval.yaml config
For each prompt in the suite, spawn a test agent with the skill loaded
Collect responses and check against criteria
Produce a benchmark report

Eval Structure

evals/
  config.yaml              # Global config (priority, classification)
  <skill-name>/
    eval.yaml              # Eval definition with prompts + criteria
    prompts/
      <test-name>.md       # Individual test prompts
    fixtures/              # Optional test data
      <file>

Running Evals

Single Skill

/bp:eval agent-browser

All Skills

/bp:eval all

By Priority

/bp:eval --priority high

Eval Definition Format

Each eval.yaml defines:

name: skill-name
skill: skill-name
classification: capability_uplift | encoded_preference

prompts:
  - name: test-name
    file: prompts/test-name.md
    fixtures:              # Optional
      - fixtures/file.json
    criteria:
      - type: contains | not_contains | matches_regex
        value: "expected string or pattern"
        reason: "Why this matters"

Criteria Types

Type	Description
`contains`	Response must contain the value
`not_contains`	Response must NOT contain the value
`matches_regex`	Response must match the regex pattern

Execution Strategy

Parallel by default: Each prompt runs as an independent agent
Skill loaded: The target skill is loaded into the test agent's context
No side effects: Evals should not modify files or run destructive commands
Timeout: 30 seconds per prompt (configurable in config.yaml)

Report Format

=== Blueprint-Dev Eval Report ===

agent-browser (Capability Uplift - HIGH)
  [PASS] navigate-and-click (3/3 criteria)
  [PASS] fill-form (3/3 criteria)
  [PASS] anti-playwright (4/4 criteria)

git-worktree (Capability Uplift - HIGH)
  [PASS] create-worktree (3/3 criteria)
  [FAIL] anti-raw-git (2/3 criteria)
    - FAILED: not_contains "git worktree add"

Summary: 4/5 passed (80%)

Skill Classifications

Capability Uplift: Teaches techniques Claude doesn't know natively. Evals focus on correct tool/pattern usage. Strict pass criteria.
Encoded Preference: Encodes team/project preferences. Evals focus on classification accuracy and compliance. Standard pass criteria.

When to Run Evals

After modifying a skill's SKILL.md
After updating skill descriptions
Before version bumps
As part of CI for the plugin repository