一键导入
eval-runner
Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Production A/B testing lifecycle for design variants. Covers hypothesis formation, feature flags, variant comparison, analytics tracking, statistical significance analysis, experiment setup, and cleanup.
Browser automation using Vercel's agent-browser CLI. Use when you need to interact with web pages, fill forms, take screenshots, or scrape data. Uses Bash commands with ref-based element selection. Triggers on "browse website", "fill form", "click button", "take screenshot", "scrape page", "web automation".
Multi-agent architecture review combining core architecture design with parallel security, performance, and data integrity assessments. Produces ADRs in MADR format. Covers ADR, architecture decision, system design, scalability assessment. Not for code review or implementation — for architectural decisions only.
Reference for how the built-in /batch command integrates with blueprint-dev workflows — parallel codebase-wide changes using worktrees with project context.
Analyzes detected stack profiles and suggests targeted CLAUDE.md improvements. Covers CLAUDE.md improvement, project configuration, AI instructions. Never auto-writes to CLAUDE.md — stages suggestions for user review.
Problem documentation methodology for compounding team knowledge. Captures solved problems with structured metadata for searchability, pattern detection, and prevention. Covers postmortem, lessons learned, debugging documentation, solved problem capture. Not for general documentation — specifically for post-debugging problem capture.
| name | eval-runner |
| description | Run evaluation suites against blueprint-dev skills. Benchmarks skill performance, tracks pass rates, and validates skill quality after changes. |
| disable-model-invocation | true |
| argument-hint | [skill-name|all] |
Run evaluation suites to benchmark and validate blueprint-dev skills. Evals test that skills produce the expected behavior — correct tool usage, proper classification, adherence to constraints.
${CLAUDE_SKILL_DIR}/../evals/eval.yaml configevals/
config.yaml # Global config (priority, classification)
<skill-name>/
eval.yaml # Eval definition with prompts + criteria
prompts/
<test-name>.md # Individual test prompts
fixtures/ # Optional test data
<file>
/bp:eval agent-browser
/bp:eval all
/bp:eval --priority high
Each eval.yaml defines:
name: skill-name
skill: skill-name
classification: capability_uplift | encoded_preference
prompts:
- name: test-name
file: prompts/test-name.md
fixtures: # Optional
- fixtures/file.json
criteria:
- type: contains | not_contains | matches_regex
value: "expected string or pattern"
reason: "Why this matters"
| Type | Description |
|---|---|
contains | Response must contain the value |
not_contains | Response must NOT contain the value |
matches_regex | Response must match the regex pattern |
=== Blueprint-Dev Eval Report ===
agent-browser (Capability Uplift - HIGH)
[PASS] navigate-and-click (3/3 criteria)
[PASS] fill-form (3/3 criteria)
[PASS] anti-playwright (4/4 criteria)
git-worktree (Capability Uplift - HIGH)
[PASS] create-worktree (3/3 criteria)
[FAIL] anti-raw-git (2/3 criteria)
- FAILED: not_contains "git worktree add"
Summary: 4/5 passed (80%)