| name | eval-agents |
| description | Audit Claude Code agents defined in .claude/agents/ for description specificity, model tier appropriateness, tools scoping, and system prompt quality. Detects dispatch ambiguity between agents, flags over-permissive tool grants, and checks for human-in-the-loop patterns that break programmatic orchestration. Use when onboarding to a project with existing agents, after adding new agents to a fleet, or when an orchestrator consistently selects the wrong agent. |
| allowed-tools | Read Glob Bash Edit |
| effort | medium |
| argument-hint | [path to agents dir, default: .claude/agents/] |
Agent Evaluator
Discover all agents in scope, score each one across five criteria, then run an interactive session to confirm or improve them one by one.
Agents are not just scripts: they are callable units selected by orchestrators based on their description. A vague description silently breaks multi-agent workflows. The goal here is not just scoring; it is leaving every agent correctly scoped, correctly modeled, and safe to call from an orchestrator.
When to Use
- First time auditing an agent fleet before wiring it into an orchestration pipeline
- An orchestrator keeps selecting the wrong agent for a task
- After copying agents from another project or importing a plugin
- A new agent was added; checking whether it conflicts with existing ones
- Periodic hygiene: "do all these agents still do something distinct?"
Key Concepts
Agent file locations
| Location | Scope | Committed? |
|---|
.claude/agents/<name>.md | Project (flat file) | Yes |
.claude/agents/<name>/AGENT.md | Project (directory-based) | Yes |
~/.claude/agents/<name>.md | User-level | No |
Plugin agents/*.md | Per plugin | Yes (in plugin) |
Both flat files and directory-based agents are valid. The name: field is the identifier used by orchestrators and hooks (passed as agent_type). The filename does not have to match, though keeping them aligned is recommended.
Frontmatter fields
| Field | Required | Notes |
|---|
name | Yes | Unique identifier (kebab-case). Hooks receive this as agent_type. Filename does not have to match. |
description | Yes | Used by orchestrators to select agents. Vague = wrong agent selected. |
model | No | Defaults to session model if absent. Explicit is always better. |
tools | No | Explicit tool allow-list. Defaults to session tools if absent (risky). Note: allowed-tools is the equivalent field for skill files (not valid in agent files). |
disallowedTools | No | Blocks specific tools even if granted by the session. |
permissionMode | No | default, acceptEdits, auto, dontAsk, bypassPermissions, plan. Flag bypassPermissions as a security risk equivalent to unconstrained tools. |
effort | No | low, medium, high, xhigh, max. Overrides session effort. |
isolation | No | worktree runs the agent in a temporary git worktree (isolated copy of the repo). |
maxTurns | No | Maximum agentic turns before the agent stops. Prevents runaway loops in pipelines. |
memory | No | user, project, or local. Persistent cross-session memory directory. |
background | No | true always runs this agent as a background task. |
Why description quality matters more for agents than for skills
Skills are invoked by humans typing a slash command. Agents are selected programmatically by orchestrators comparing all candidate descriptions against a task. Two agents with similar descriptions produce non-deterministic selection; the orchestrator will flip between them across runs, giving inconsistent results. Description overlap is a correctness bug, not a style issue.
Good agent description pattern:
"Use when X and NOT Y. Handles Z. Returns [format]."
Weak pattern:
"Helps with code quality and review."
Model tier selection
| Task type | Model | Signals |
|---|
| Mechanical: lookup, pattern match, binary pass/fail | haiku | grep results, count files, extract value, apply template |
| Judgment: review, analyze, write, debug (bounded scope) | sonnet | code review, test writing, documentation, diagnosis |
| Deep synthesis: architecture, adversarial, cross-system | opus | security audit, ADR generation, system design review |
Using opus for a task that only needs haiku burns 20-50x more tokens per call. In a pipeline that calls an agent 100 times, that becomes expensive. Flag model-task mismatches both directions: under-powered (produces shallow output) and over-powered (wastes budget).
Human-in-the-loop anti-patterns
Agents called from orchestrators run headlessly. Any system prompt instruction that assumes a human is watching will silently break or stall the pipeline:
| Anti-pattern | What breaks |
|---|
| "Ask the user for clarification" | Agent blocks waiting for input that never comes |
| "Confirm before deleting" | Same: no confirmation possible |
| "Show results and wait for approval" | Orchestrator receives no output, times out |
| "If unsure, ask" | Produces empty output instead of best-effort result |
Safe alternative: "If context is insufficient, return { "status": "needs_context", "missing": [...] } and stop."
Scoring Criteria (15 pts per agent)
| # | Criterion | Max | What is checked |
|---|
| 1 | name | 1 | name field present and uses kebab-case; recommend matching the filename for clarity |
| 2 | description | 4 | Present (1pt), contains trigger phrasing specifying WHEN to use (1pt), specific enough to distinguish from other agents in the same fleet (1pt), concise and front-loaded with the most important triggers first (1pt) |
| 3 | model | 2 | Present (1pt), appropriate tier for task complexity using matrix above (1pt) |
| 4 | tools | 3 | tools field present and explicit (1pt), no * wildcard without written justification in the system prompt (1pt), write-capable tools (Bash, Edit, Write) only present when the agent's task requires file mutation (1pt) |
| 5 | system prompt | 5 | Has role/scope statement in first paragraph (1pt), defines task boundaries (what it does AND does not do) (1pt), specifies output format or return contract (1pt), no placeholder text or TODO comments (1pt), no human-in-the-loop patterns from the table above (1pt) |
| Bonus | hardening | +1 | effort field present and inferred-appropriate for task complexity |
Thresholds:
- Good: >=12/15 (>=80%)
- Needs work: 9-11/15 (60-79%)
- Fix: <9/15 (<60%)
No-tools agents (no tools: field): they inherit the session tool set, which is usually broader than needed. Treat this as 0/3 on criterion 4 and flag explicitly: unconstrained tool access in a fleet agent is a security surface.
Execution Instructions
Step 1: Discovery
Find all agent files in scope:
find .claude/agents -maxdepth 1 -name "*.md" 2>/dev/null
find .claude/agents -name "AGENT.md" 2>/dev/null
find ~/.claude/agents -name "*.md" 2>/dev/null
If an argument was passed (e.g., /eval-agents examples/agents/), use that path instead.
Build a flat list of agent records:
source_file: path to the .md file
agent_name: directory name or filename stem
model: declared model or (inherits session)
tools: list of declared tools or (inherits session)
description_length: character count of description field
system_prompt_length: line count of body after frontmatter
If no agents are found, report it and stop.
Step 2: Parse and classify
For each agent:
- Read the full file
- Extract YAML frontmatter
- Collect all frontmatter fields (flag any unsupported/unknown fields)
- Read the body: line count, presence of role statement, output format declaration
- Scan for human-in-the-loop anti-patterns (keyword scan: "ask the user", "confirm", "wait for approval", "if unsure ask")
Step 3: Check description overlap
Compare descriptions pairwise across all agents in the same directory:
For each pair (A, B):
- If their
description fields share more than 3 key nouns or verbs, flag as potential dispatch conflict
- If both use identical trigger phrasing ("Use for code review"), flag as duplicate
Report conflicts as:
⚠️ Dispatch conflict: code-reviewer.md ↔ integration-reviewer.md
Both describe "code review" and "quality checks"; orchestrator may conflate them.
Suggestion: differentiate by scope (PR diff vs full module vs API surface).
Step 4: Model appropriateness check
For each agent, infer the appropriate model tier from the system prompt content:
Haiku signals (mechanical): "extract", "count", "list", "find files matching", "output JSON with fields X Y Z", "grep for", "check if exists", verbs with no judgment required
Sonnet signals (judgment): "review", "analyze", "debug", "write tests", "document", "suggest improvements", "classify", bounded scope per invocation
Opus signals (deep synthesis): "architect", "threat model", "security audit", "design decision", "evaluate trade-offs across", "cross-cutting", "multi-system", scope spans entire codebase or long-horizon reasoning
Flag mismatches:
- Declared
haiku, inferred sonnet or higher: ⚠️ may produce shallow output
- Declared
opus, inferred haiku: ⚠️ burning 20-50x budget for mechanical work
Step 5: Interactive review (core of the skill)
Process agents one by one. Do not batch and skip the interaction.
For each agent:
Show:
Agent: code-reviewer.md [.claude/agents/]
model: sonnet
tools: Read, Grep, Glob
description: "Use for thorough code review with quality, security, and performance checks"
System prompt: 67 lines
Inferred model tier: sonnet ✅
Description specificity: ⚠️ overlaps with integration-reviewer.md on "quality checks"
Human-in-the-loop patterns: none found ✅
Write tools: none (read-only) ✅
Ask four questions:
- "Is the description specific enough to distinguish this agent from similar ones? (y = yes / n = rewrite)"
- "Is the model tier right for what this agent does? (y / change to haiku/sonnet/opus)"
- "Are the tools correctly scoped? (y / add / remove)"
- "Anything to update in the system prompt? (describe or skip)"
If the user provides changes during the interaction: apply them using Edit, confirm each change, then move to the next agent.
For agents with no tools field:
Show:
Agent: planner.md [.claude/agents/]
model: (inherits session)
tools: (inherits session, UNCONSTRAINED)
Flag strongly: no explicit tools field means the agent inherits whatever the calling session has, including write access it may not need. Ask whether to add an explicit tools: constraint.
Step 6: Output report
After all agents are reviewed:
# Agents Audit: [project name or path]
Date: [today] | Scanned: N agents
## Summary
| Status | Count |
|--------|-------|
| ✅ Good (>=80%) | N |
| ⚠️ Needs work (60-79%) | N |
| ❌ Fix (<60%) | N |
| ⚠️ Dispatch conflicts detected | N pairs |
| ⚠️ No explicit tools field | N |
| ✅ User confirmed useful | N |
| ⚠️ User flagged for update | N |
| 🗑️ User marked as stale | N |
---
## Per-Agent Results
### code-reviewer.md (11/15 ⚠️)
model: sonnet (inferred: sonnet ✅)
tools: Read, Grep, Glob (read-only ✅)
description length: 63 chars
| Criterion | Score | Notes |
|-----------|-------|-------|
| name | ✅ 1/1 | kebab-case, matches filename |
| description | ⚠️ 2/4 | Present, trigger phrasing ok, but overlaps with integration-reviewer.md; over 1,536 chars |
| model | ✅ 2/2 | sonnet appropriate for bounded code review |
| tools | ✅ 3/3 | explicit, read-only, no wildcard |
| system prompt | ⚠️ 3/5 | role statement present, boundaries defined, no output format specified, no placeholder, no human-in-the-loop patterns |
| hardening (bonus) | ❌ 0/1 | no effort field |
**Priority fixes:**
1. Narrow description to distinguish from integration-reviewer.md (scope: single file/PR diff, not API surface)
2. Add output format contract to system prompt (e.g., returns markdown with severity table)
3. Add `effort: medium`
User feedback: description updated ✅
System prompt: output format added ✅
---
Step 7: Fix Summary
## What Changed This Session
code-reviewer.md:
- Description narrowed: added "PR diff scope, not API surface"
- System prompt: added output format contract
- Added effort: medium
planner.md:
- Added tools: Read, Write (was inheriting session tools)
- User confirmed model: sonnet is appropriate
security-auditor.md:
- User flagged as stale (awaiting explicit deletion confirmation)
---
N agents audited · N edits applied · N dispatch conflicts resolved · N flagged as stale
For any agent the user marked as stale: ask for explicit confirmation before removing the file. Never delete without a clear "yes, remove it".
Edge Cases
- No
tools: field: agent inherits session tools; flag as 0/3 on tools criterion, always raise in interactive step regardless of other scores
tools: "*" wildcard: full session access, treat as equivalent to no tools field unless the system prompt explicitly justifies why unrestricted tools are needed (e.g., a general-purpose agent)
permissionMode: bypassPermissions: flag immediately, same severity as an unconstrained tools field; the agent can execute operations without approval including writes to .git, .claude, and config directories
model field absent: flag as ⚠️ in model criterion; add model inference note ("inferred sonnet from content signals")
- Description under 20 chars: too vague to be usable for dispatch; flag as 0/4 on description criterion
- Excessively long description: no confirmed character limit exists in the official docs; front-load the most important trigger conditions since orchestrators use early tokens for dispatch; flag descriptions over 800 chars as a style warning
- System prompt under 10 lines: likely a stub or placeholder; flag for content quality review
isolation: worktree on a read-only agent: filesystem isolation is unnecessary when the agent has no write tools (Edit, Write, Bash); flag as redundant overhead and ask whether isolation is intentional
- Human-in-the-loop pattern in a fork agent: especially problematic since forked agents have no terminal to interact with; flag as ❌ on system prompt criterion 5
- Duplicate agent names (same
name: field, different files): runtime loads both, last one wins; flag immediately
- Agent file not readable: report and skip (may be a permissions issue or empty file)
disallowedTools with a tool not in tools: redundant but harmless; note it
- Flat
.claude/agents/ file with same stem as a subdirectory agent: naming conflict, flag which one the runtime will use
- Description that says "use for everything" or "general purpose": acceptable only for catch-all agents placed intentionally at the bottom of a dispatch priority stack; flag others
- Two agents with identical
model and tools but different descriptions: may be legitimate specializations or may be redundant copies from a refactor; surface to user during interactive step