一键在 Manus 中运行任何 Skill

$pwd:

eval-runner

Name: Eval Runner
Author: bdfinst

// Run eval fixtures against review agents and grade results. Use this after adding or modifying a review agent, to validate detection accuracy, or when the user says "run the evals", "test the agents", "check for regressions", or "how accurate is the agent".

在 Manus 中运行

$ git log --oneline --stat

stars:11

forks:2

updated:2026年3月5日 19:10

SKILL.md

readonly

related-skills.json

同仓库

add-agent.md

from "bdfinst/cab-killer"

Scaffold a new review agent from a description or URL. Use this whenever the user wants to add a new review agent, detect a new category of code issue, or says things like "add an agent for X", "create a reviewer for Y", "I want to check for Z in code reviews". Also use when given a URL to a coding standard or best-practices guide that should become a review agent.

2026-03-0511

add-plugin.md

from "bdfinst/cab-killer"

Install a Claude Code plugin and register it in plugins.json so the full team can replicate the install. Use this whenever adding a new plugin to the project — it keeps plugins.json in sync with what is actually installed.

2026-03-0511

apply-fixes.md

from "bdfinst/cab-killer"

Apply correction prompts generated by /code-review. Use this whenever the user wants to apply, fix, or action the results of a code review — phrases like "apply the fixes", "fix the issues", "apply corrections", or after /code-review has run and produced a corrections/ directory.

2026-03-0511

code-review.md

from "bdfinst/cab-killer"

Run all enabled review agents against target files. Use this whenever the user asks for a code review, wants feedback on their code, says "review my code", "check this before I PR", "what's wrong with this", "run the agents", or has just finished implementing a feature. Use proactively before commits and pull requests.

2026-03-0511

eval-audit.md

from "bdfinst/cab-killer"

Audit code-review agents, skills, and hooks for structural compliance. Use this when adding or modifying any agent, skill, or hook file, or for a periodic health check of the toolkit. Trigger phrases: "audit the agents", "check compliance", "validate the skills", "are the agents correct", or any time agent/skill files change.

2026-03-0511

review-agent.md

from "bdfinst/cab-killer"

Run a single named review agent against target files. Use this when the user names a specific agent (e.g. "run security-review", "check for test issues", "run js-fp-review on this file") rather than wanting the full suite. Prefer this over /code-review when only one concern is relevant or speed matters.

2026-03-0511

package.json

"author": "bdfinst"

"repository": "bdfinst/cab-killer"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件质量保证分析师与测试员计算机与数学类职业15-1253L4

name	eval-runner
description	Run eval fixtures against review agents and grade results. Use this after adding or modifying a review agent, to validate detection accuracy, or when the user says "run the evals", "test the agents", "check for regressions", or "how accurate is the agent".
argument-hint	[--agent <name>] [--fixture <name>] [--trials <n>] [--verbose]
user-invocable	true
allowed-tools	Read, Grep, Glob, Bash(readlink , ls , date , mkdir ), Skill(review-agent *)

Eval Runner

Role: orchestrator. This skill dispatches fixtures to agents and grades results — it does not review code itself.

You have been invoked with the /eval-runner skill. Run review agents against eval fixtures and grade the results.

Orchestrator constraints

Do not review code yourself. Delegate all reviews to /review-agent. Your job is dispatching and grading.
Grade deterministically. Compare agent JSON output against expected JSON using exact criteria (status match, count ranges, keyword checks). Do not apply judgment.
Minimize context per agent. Pass only the fixture file to the agent — not the expected results, not other fixtures, not prior transcripts.
Track results. Save transcripts for saturation detection. Do not modify fixtures or expected files.
Be concise. Output the report table and failure details. No narration of each fixture run — just the grades.

Parse Arguments

Arguments: $ARGUMENTS

--agent <name>: Run only the named agent (e.g., js-fp-review)
--fixture <name>: Run only the named fixture (e.g., fp-array-mutations.ts)
--trials <n>: Run each fixture N times (default: 1). Enables pass@k scoring.
--verbose: Show full agent output for each fixture
No arguments: run all agents against all applicable fixtures

Steps

1. Resolve eval corpus

Verify evals/fixtures/ exists in the plugin root. If not, error: "Cannot find eval fixtures. Are you in the cab-killer plugin directory?"

2. Load fixtures and expected results

Read all files from evals/fixtures/ and corresponding JSON from evals/expected/.

For each fixture:

Match the fixture stem (filename without extension) to its expected JSON
For directory fixtures (cs-*), the directory name is the stem
Parse applicableAgents to know which agents to run

If --agent is specified, filter to fixtures where that agent is in applicableAgents. If --fixture is specified, filter to that fixture only.

3. Run agents against fixtures

For each fixture/agent pair:

Invoke /review-agent <agent-name> with the fixture file/directory as the target
Parse the agent's JSON output to extract: status, issues[], summary
If running multiple trials (--trials), repeat and collect all results

4. Grade each result

Compare agent output against expected JSON:

Status match:

Agent status matches expectedStatus → PASS
Agent status is "skip" and fixture is not in applicableAgents → PASS (correct skip)
Mismatch → FAIL

Issue count:

issues.length within issueCount.min to issueCount.max → PASS
Outside range → FAIL

Severity counts:

For each severity in expected severities, count matching issues
Count within min to max → PASS
Outside range → FAIL

Keyword checks:

For each keyword in mustMention: at least one issue message contains keyword (case-insensitive) → PASS
For each keyword in mustNotMention: no issue message contains keyword → PASS
Violation → FAIL

Each check produces PASS/FAIL. Overall fixture grade: PASS only if all checks pass.

5. Compute pass@k (multi-trial)

If --trials > 1:

pass@1: fraction of fixtures that passed on the first trial
pass@k: fraction of fixtures that passed on at least one of k trials
Consistency: fraction of fixtures with identical results across all trials

6. Detect eval saturation

Track the last 3 runs in evals/transcripts/. If the last 3 consecutive runs for an agent produce identical grades, flag as "saturated" — the expected ranges may need tightening.

7. Save transcript

Create evals/transcripts/<timestamp>-<agent>.json:

{
  "timestamp": "2026-03-01T12:00:00Z",
  "agent": "<name>",
  "trials": 1,
  "results": [
    {
      "fixture": "<name>",
      "grade": "pass|fail",
      "checks": {
        "status": "pass|fail",
        "issueCount": "pass|fail",
        "severities": "pass|fail",
        "mustMention": "pass|fail"
      },
      "agentOutput": { "status": "...", "issues": [], "summary": "..." }
    }
  ],
  "summary": {
    "total": 0,
    "passed": 0,
    "failed": 0,
    "passRate": "N%",
    "passAtK": "N%",
    "saturated": ["agent-name"]
  }
}

8. Generate report

Save to evals/reports/<timestamp>-report.md and display:

# Eval Report — <timestamp>

## Summary
| Metric | Value |
| --- | --- |
| Fixtures | N |
| Passed | N |
| Failed | N |
| Pass rate | N% |
| Pass@k | N% (k=N) |
| Saturated | N agents |

## Results by Agent
| Agent | Fixtures | Passed | Failed | Rate |
| --- | --- | --- | --- | --- |
| js-fp-review | 6 | 5 | 1 | 83% |
| ... | | | | |

## Failures
| Fixture | Agent | Check | Expected | Got |
| --- | --- | --- | --- | --- |
| fp-array-mutations.ts | js-fp-review | issueCount | 4-8 | 2 |
| ... | | | | |

## Saturation Warnings
- js-fp-review: 3 identical runs — consider tightening ranges

9. Progress tracking

Copy and update this checklist:

- [ ] Toolkit root resolved
- [ ] Fixtures loaded
- [ ] Expected results loaded
- [ ] Agents executed
- [ ] Results graded
- [ ] Transcript saved
- [ ] Report generated

eval-runner

同仓库更多 Skills

同仓库更多 Skills

Eval Runner

Orchestrator constraints

Parse Arguments

Steps

1. Resolve eval corpus

2. Load fixtures and expected results

3. Run agents against fixtures

4. Grade each result

5. Compute pass@k (multi-trial)

6. Detect eval saturation

7. Save transcript

8. Generate report

9. Progress tracking

Eval Runner

Orchestrator constraints

Parse Arguments

Steps

1. Resolve eval corpus

2. Load fixtures and expected results

3. Run agents against fixtures

4. Grade each result

5. Compute pass@k (multi-trial)

6. Detect eval saturation

7. Save transcript

8. Generate report

9. Progress tracking