一键在 Manus 中运行任何 Skill

$pwd:

agent-evaluator

Name: Agent Evaluator
Author: fjrevoredo

// Evaluate a subagent by running its scenario suite as parallel subagents, scoring results against observable criteria, and producing exact FIND/REPLACE edits for any failing behaviors. Use this skill whenever the user wants to evaluate, test, score, or quality-check a subagent — even if they don't use those exact words. Triggers on: "evaluate the agent", "run evals on [agent]", "assess agent performance", "score the agent", "check if the agent works", "run the eval suite", "test the agent", "quality-check [agent]", "is the agent working correctly", or after applying fixes "did that improve things?". Also triggers proactively after an agent file is edited and the user asks "does it work now?" or similar. Triggers: evaluate agent, run evals, assess agent, score agent, test agent, quality-check, eval suite, agent performance, agent working, improve agent, rerun evals, benchmark agent.

在 Manus 中运行

$ git log --oneline --stat

stars:0

forks:0

updated:2026年5月22日 11:39

文件资源管理器

5 个文件

SKILL.md

readonly

related-skills.json

同仓库

skill-extractor.md

from "fjrevoredo/agent-skills"

Port a repo-specific skill into a generic, reusable skill for the agent-skills library. Use this skill whenever the user asks to extract a skill from a project, generalize a skill, port a skill to another repo, convert a project-specific skill into a reusable one, or remove repo-specific context from a skill. Also use when the user mentions skill extraction, skill generalization, or creating a library skill from a project skill — even if they don't explicitly say "extract" or "generalize." If a skill exists in a project's `.agents/skills/` directory and the user wants to make it reusable across repositories, this skill should be used. Triggers: extract skill, generalize skill, port skill, convert skill, remove repo-specific, make skill reusable, skill extraction, skill generalization, library skill, repo-agnostic skill, skill from project.

2026-05-170

todo-manager.md

from "fjrevoredo/agent-skills"

End-to-end management of TODO items in a Markdown-based backlog. Covers creation with auto-assigned IDs, status tracking, archival of completed items, and format validation. Use this skill whenever the user asks to add a new TODO, check TODO status, archive completed items, validate the TODO system, or set up a new TODO backlog from scratch. Also use when the user mentions a task list, backlog, roadmap, or task tracking in a Markdown file — even if they don't explicitly say "TODO". If the repo has a `TODO.md` file or similar Markdown-based task tracking, this skill should be used to interact with it. Triggers: TODO, todo item, add todo, create todo, todo status, archive todo, validate TODO, TODO ID, TODO-0001, check TODO, todos, list todos, todo list, backlog, task tracking, roadmap, task list.

2026-05-170

skill-improver.md

from "fjrevoredo/agent-skills"

Captures lessons from completed tasks and turns them into concrete improvements. Triggers proactively after high-ceremony tasks (build failures, multi-hour sessions, production incidents) and on-demand when the user asks for "post-mortem", "reflect", "improve the skill", "lessons learned", "what went wrong". Covers skills, repo artifacts (scripts, CI, configs), and process changes. For small/obvious fixes, edits directly. For structural changes, produces a plan. Use this whenever a task surfaced gaps in tooling, documentation, or automation.

2026-05-100

manual-planning.md

from "fjrevoredo/agent-skills"

Create, update, review, and execute manual Markdown implementation plans when harness planning mode is not being used. Use when the user asks for a plan file, manual plan, implementation plan, execution plan, roadmap, task checklist, planning document, or agent-maintained plan with statuses, validations, milestones, approval gates, and cleanup steps.

2026-05-070

package.json

"author": "fjrevoredo"

"repository": "fjrevoredo/agent-skills"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件质量保证分析师与测试员计算机与数学类职业15-1253L4

name	agent-evaluator
version	1.0.0
description	Evaluate a subagent by running its scenario suite as parallel subagents, scoring results against observable criteria, and producing exact FIND/REPLACE edits for any failing behaviors. Use this skill whenever the user wants to evaluate, test, score, or quality-check a subagent — even if they don't use those exact words. Triggers on: "evaluate the agent", "run evals on [agent]", "assess agent performance", "score the agent", "check if the agent works", "run the eval suite", "test the agent", "quality-check [agent]", "is the agent working correctly", or after applying fixes "did that improve things?". Also triggers proactively after an agent file is edited and the user asks "does it work now?" or similar. Triggers: evaluate agent, run evals, assess agent, score agent, test agent, quality-check, eval suite, agent performance, agent working, improve agent, rerun evals, benchmark agent.

Agent Evaluator

Evaluate subagents by running their scenario suites, scoring each scenario against observable criteria, and producing exact edits for anything that fails. The user should be able to say "evaluate my-agent" and get a scored report plus actionable fixes — no manual steps.

Configurable Paths

These paths are defaults. Override them if the consuming repo uses a different layout.

Field	Default	Role
`agents_dir`	`.claude/agents/`	Directory containing agent `.md` files
`evals_dir`	`<agents_dir>/evals/`	Directory containing scenario files and results

Step 0 — Parse Input and Locate Agent

Accept: an agent name (e.g. my-agent) or a path to an agent .md file.

Resolve the agent file:

If given a name, look in the configured agents_dir:
```
<agents_dir>/<name>.md
```
If given a path, use it directly.

If the file does not exist, stop immediately:

"Agent not found: <resolved-path> does not exist. Check the agent name and make sure it is registered before running an eval."

Locate the evals directory — it lives next to the agent file:

<agents_dir>/<name>.md          ← agent file
<agents_dir>/evals/             ← scenario files, results

If no evals/ directory exists → generate scenarios (see references/scenario-generation.md).

Quick Mode

Check for --quick flag in the user's prompt. Quick mode runs a representative subset of scenarios (roughly 5) that covers the most important behavior categories. Select the subset by scanning scenario filenames for coverage across categories — prioritize scenarios that test output quality, scope discipline, error handling, and safety.

Use quick mode when the user mentions "quick check" or "just the critical ones" after a small prompt edit.

Step 1 — Smoke Test

Before running the full suite, confirm the agent can be invoked correctly:

Invoke the target agent with a simple identity prompt:

"What is your purpose? Reply in one sentence."

If the response is coherent and from the right agent → proceed. If it fails, loads the wrong agent, or returns an error → abort immediately:

"Smoke test failed — the agent could not be invoked. Check that <agent-name> is registered and try again. Full eval aborted."

Do not attempt to run scenarios against a broken invocation path.

Step 2 — Load Scenarios

Read every scenario file in the evals/ directory. For each, extract:

Scenario ID (from filename or content header)
Whether it mutates files (look for setup/cleanup blocks or a mutates: true marker)
Full file content

Apply quick mode filter if active.

Scenario Categories

Scenarios should be organized with a category prefix in their filename:

Prefix	Category	Example
`E-`	Output quality / correctness	`E-01-correct-output-format.md`
`C-`	Code mutation / edit scenarios	`C-01-fix-broken-test.md`
`A-`	Scope and boundary decisions	`A-01-stays-in-scope.md`
`B-`	Build/test integration	`B-01-clean-test-output.md`
`D-`	Dependency/impact analysis	`D-01-affected-scope.md`
`F-`	Safety and constraint respect	`F-01-no-unauthorized-edits.md`
`GEN-`	Auto-generated (see scenario generation)	`GEN-01-happy-path.md`

This convention is recommended but not required — the evaluator works with any filenames matching *.md in the evals directory.

Step 3 — Pre-compute Discovery Data

Several scenarios may need repo-specific facts before they can run. Compute these once and inject them into the relevant scenario prompts. This prevents each scorer from re-running the same shell commands (wastes tokens) or doing it incorrectly.

How to identify what to discover:

Scan scenario files for DISCOVERY: placeholders or instructions that say "find a project with X".
Run the necessary shell commands to resolve those placeholders.
Inject results as DISCOVERY: lines at the top of the relevant scenario prompts before spawning scorers.

Example format:

DISCOVERY: project_without_tests = libs/my-lib
DISCOVERY: first_spec_file = my-component.spec.ts

If a discovery returns nothing (e.g. no matching projects), inject:

DISCOVERY: test_framework_project = none — mark relevant scenario as SCORE: na

The discovery commands are repo-specific — read the scenario files to determine what facts are needed. Common patterns include: finding projects with specific configurations, locating test files, identifying build targets.

Step 4a — Non-mutating Scenarios

The orchestrator — not the scorer — invokes the target agent. This prevents scorer subagents from loading the wrong skill, which can produce false scores.

For each non-mutating scenario:

Extract the exact prompt from the scenario file (typically the line after > ").
Inject discovery data if applicable.
Invoke the target agent directly with the exact scenario prompt.
Capture the full response.
Pass it to a scorer subagent with the non-mutating template from references/scorer-prompt-template.md.

Spawn all scorer subagents in a single message so they run in parallel.

Step 4b — Mutating Scenarios

These temporarily edit source files. Run one at a time using isolated worktrees (if available) or careful setup/cleanup so mutations cannot conflict.

For each mutating scenario, spawn a scorer that handles setup → invoke → score → cleanup. Use the mutating variant from references/scorer-prompt-template.md.

Run mutating scenarios sequentially. Wait for each to complete before starting the next.

Step 5 — Validate and Collect Result Blocks

As each subagent returns, immediately validate that the response ends with ---END---. Collect the result blocks in memory as they arrive — do not write to disk yet.

If a block is missing or malformed, synthesize a fallback immediately:

---RESULT---
SCENARIO: [ID]
SCORE: infra-fail
CRITERION_runner_completed: no
AGENT_OUTPUT: scorer did not return a result block
NOTES: re-run this scenario manually
---END---

infra-fail is excluded from the score denominator (same as na). It signals an infrastructure problem, not an agent behavior problem — do not let it inflate failure counts.

Step 6 — Consistency Check

Before writing results, scan for semantically equivalent scenario pairs. These are scenarios that test the same underlying behavior from different angles.

Look for pairs in scenario metadata or identify them by comparing what_being_tested fields. If two equivalent scenarios differ by more than 1 point, flag them:

⚠️ CONSISTENCY: [ID-A] scored 0, [ID-B] scored 3 — same behavior, high divergence.
   This likely indicates an invocation problem on [ID-A], not an agent regression.
   Treat [ID-A] as infra-fail and re-run manually to confirm.

Step 7 — Write COLLECT-RESULTS.md

Once all result blocks are collected (from scorers or synthesized as fallbacks), overwrite <evals-dir>/COLLECT-RESULTS.md in a single write:

# Eval Results — <agent-name>
_Collected by agent-evaluator skill — <date>_

[all result blocks, one per scenario, separated by blank lines]

Step 8 — Interpret Results

Read references/interpreter-prompt.md. Substitute:

<AGENT_FILE_CONTENT> → full content of the agent file
<RESULTS> → full content of COLLECT-RESULTS.md
<EVALS_DIR> → path to the evals directory
<AGENT_NAME> → the agent's name

Spawn an interpreter subagent with the substituted prompt.

The interpreter writes its output to <evals-dir>/LAST-INTERPRET-OUTPUT.md.

If the interpreter subagent fails, write a minimal fallback to that file:

# Interpreter failed
Run the interpretation manually with the collected results below.

[paste COLLECT-RESULTS.md content]

Step 9 — Present Results and Offer Edits

Display to the user:

The score summary table (copied from LAST-INTERPRET-OUTPUT.md).
Count of scenarios scoring 0 or 1 (excluding na and infra-fail).
Count of proposed edits.

Then ask:

"Found [N] scenarios scoring 0–1. [M] edits proposed. Apply them now?"

If yes:

Apply each FIND / REPLACE WITH block to the agent file using the Edit tool.
Re-run only the failing scenarios (repeat steps 3–8 scoped to those IDs).
Report the delta: "X improved, Y still failing."

If no:

Confirm: "Fixes are in <evals-dir>/LAST-INTERPRET-OUTPUT.md."
Done.

Reference Files

File	When to read
`references/scorer-prompt-template.md`	When building scorer subagent prompts (steps 4a, 4b)
`references/interpreter-prompt.md`	When building the interpreter subagent prompt (step 8)
`references/scenario-format.md`	Wire format, scoring rubric, result block spec — reference when validating or synthesizing blocks
`references/scenario-generation.md`	When no evals/ dir exists and scenarios must be generated

Gotchas

Scorer invokes the wrong agent. If the scorer subagent uses a Skill tool instead of the Agent tool, it may load a skill instead of the target agent. That's why the orchestrator (you) must invoke the target agent for non-mutating scenarios and pass the captured response to the scorer. The scorer only scores — it never invokes.
Missing ---END--- delimiter. If a scorer's response doesn't end with ---END---, the result block is considered malformed. Synthesize a fallback infra-fail block rather than trying to parse a partial result.
Mutating scenarios conflict. Never run multiple mutating scenarios in parallel — they edit source files and can corrupt each other's state. Always run them sequentially, and always verify cleanup completed before starting the next one.
Discovery data stales. Pre-computed discovery data reflects the repo at eval time. If someone pushes changes mid-eval, discovery may be stale. For long-running evals, re-run discovery before each mutating scenario.
infra-fail vs. genuine failure. When a scenario fails to invoke the agent at all (wrong name, tool unavailable), mark it infra-fail, not 0. infra-fail is excluded from the score denominator so it doesn't unfairly drag down the agent's score. Reserve 0 for cases where the agent ran but produced wrong behavior.
Quick mode selection. When using quick mode, don't just pick the first 5 scenarios. Select for coverage across behavior categories (output quality, scope discipline, error handling, safety) to get a meaningful signal from fewer runs.

agent-evaluator

同仓库更多 Skills

同仓库更多 Skills

Agent Evaluator

Configurable Paths

Step 0 — Parse Input and Locate Agent

Quick Mode

Step 1 — Smoke Test

Step 2 — Load Scenarios

Scenario Categories

Step 3 — Pre-compute Discovery Data

Step 4a — Non-mutating Scenarios

Step 4b — Mutating Scenarios

Step 5 — Validate and Collect Result Blocks

Step 6 — Consistency Check

Step 7 — Write COLLECT-RESULTS.md

Step 8 — Interpret Results

Step 9 — Present Results and Offer Edits

Reference Files

Gotchas

Agent Evaluator

Configurable Paths

Step 0 — Parse Input and Locate Agent

Quick Mode

Step 1 — Smoke Test

Step 2 — Load Scenarios

Scenario Categories

Step 3 — Pre-compute Discovery Data

Step 4a — Non-mutating Scenarios

Step 4b — Mutating Scenarios

Step 5 — Validate and Collect Result Blocks

Step 6 — Consistency Check

Step 7 — Write COLLECT-RESULTS.md

Step 8 — Interpret Results

Step 9 — Present Results and Offer Edits

Reference Files

Gotchas