with one click
agent-evaluation
// Evaluate agents and skills for quality and standards compliance.
// Evaluate agents and skills for quality and standards compliance.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | agent-evaluation |
| description | Evaluate agents and skills for quality and standards compliance. |
| user-invocable | false |
| context | fork |
| allowed-tools | ["Read","Grep","Glob","Bash"] |
| routing | {"triggers":["evaluate agent","audit agent","score skill","check quality","grade agent","agent quality"],"category":"meta-tooling","pairs_with":["agent-comparison","skill-eval","skill-creator"]} |
Objective, evidence-based quality assessment for agents and skills. Implements a 6-phase rubric: Identify, Structural, Content, Code, Integration, Report. Every finding must cite a file path and line number — no subjective "looks good" verdicts.
| Signal | Load These Files | Why |
|---|---|---|
| tasks related to this reference | batch-evaluation.md | Loads detailed guidance from batch-evaluation.md. |
| tasks related to this reference | common-issues.md | Loads detailed guidance from common-issues.md. |
| tasks related to this reference | report-templates.md | Loads detailed guidance from report-templates.md. |
| tasks related to this reference | scoring-rubric.md | Loads detailed guidance from scoring-rubric.md. |
Goal: Determine what to evaluate and confirm targets exist.
Read the repository CLAUDE.md first to understand current standards before evaluating anything. Only evaluate what was explicitly requested — do not speculatively analyze additional agents or skills.
# List all agents
ls agents/*.md | wc -l
# List all skills
ls -d skills/*/ | wc -l
# Verify specific target
ls agents/{name}.md
ls -la skills/{name}/
Gate: All targets confirmed to exist on disk. Proceed only when gate passes.
Goal: Check that required components exist and are well-formed.
Score every rubric category — never skip a category even if it "looks fine." Parse each required field explicitly rather than eyeballing YAML. Record PASS/FAIL with the line number for each check.
Run score-component.py to get deterministic PASS/FAIL for all structural checks. The script implements the full ADR-031 rubric (frontmatter, operator context, error handling, referenced files, anti-patterns) and outputs per-check results with line references. Do not re-implement these checks inline — read the JSON output and move directly to scoring.
# Deterministic structural checks via score-component.py
python3 scripts/score-component.py agents/{name}.md --json
# or for a skill:
python3 scripts/score-component.py skills/{name}/SKILL.md --json
The JSON output includes results[0].checks (per-check status, earned_points, max_points, detail) and results[0].total (aggregate score). Record each check status from the JSON — do not re-run grep -c for sections the script already covers.
What score-component.py covers (do not duplicate):
What requires LLM judgment in Phase 3+ (not covered by the script):
allowed-tools list format vs comma-separated stringdescription pipe format with WHAT + WHEN + negative constraintversion set to 2.0.0Structural Scoring (60 points):
| Component | Points | Requirement |
|---|---|---|
| YAML front matter | 10 | All required fields, list format, pipe description |
| Operator Context | 20 | All 3 behavior types with correct item counts |
| Error Handling | 10 | Section present with documented errors |
| Examples (agents) / References (skills) | 10 | 3+ examples or 2+ reference files |
| CAN/CANNOT | 5 | Both sections present with concrete items |
| Anti-Patterns | 5 | 3-5 domain-specific patterns with 3-part structure |
Integration Scoring (10 points):
| Component | Points | Requirement |
|---|---|---|
| References and cross-references | 5 | Shared patterns linked, all refs resolve |
| Tool and link consistency | 5 | allowed-tools matches usage, anti-rationalization table present |
See references/scoring-rubric.md for full/partial/no credit breakdowns.
Gate: All structural checks scored with evidence. Proceed only when gate passes.
Goal: Measure content quality and volume.
Do not estimate length by impression — count lines and calculate the score. "Content is long enough" is not a measurement.
# Skill total lines (SKILL.md + references)
skill_lines=$(wc -l < skills/{name}/SKILL.md)
ref_lines=$(cat skills/{name}/references/*.md 2>/dev/null | wc -l)
total=$((skill_lines + ref_lines))
# Agent total lines
agent_lines=$(wc -l < agents/{name}.md)
Depth Scoring (30 points max):
| Total Lines | Score | Grade |
|---|---|---|
| >1500 (skills) / >2000 (agents) | 30 | EXCELLENT |
| 500-1500 / 1000-2000 | 22 | GOOD |
| 300-500 / 500-1000 | 15 | ADEQUATE |
| 150-300 / 200-500 | 8 | THIN |
| <150 / <200 | 0 | INSUFFICIENT |
Gate: Depth score calculated. Proceed only when gate passes.
Goal: Validate that code examples and scripts are functional.
A script existing on disk does not mean it works — run python3 -m py_compile on every .py file. Search for placeholder text in every file, not just files that "look incomplete."
python3 -m py_compile on all .py files[TODO], [TBD], [PLACEHOLDER], [INSERT]```) vs tagged (```language) blocks# Python syntax check
# Syntax-check any .py scripts found in the skill's scripts/ directory
python3 -m py_compile scripts/*.py 2>/dev/null
# Placeholder search
grep -nE '\[TODO\]|\[TBD\]|\[PLACEHOLDER\]|\[INSERT\]' {file}
# Untagged code blocks
grep -c '```$' {file}
Gate: All code checks complete. Proceed only when gate passes.
Goal: Confirm cross-references and tool declarations are consistent.
Reference Resolution:
references/)../shared-patterns/)Tool Consistency:
allowed-tools from YAML front matterallowed-toolsAnti-Rationalization Table:
anti-rationalization-core.md# Check referenced files exist
grep -oE 'references/[a-z-]+\.md' skills/{name}/SKILL.md | while read ref; do
ls "skills/{name}/$ref" 2>/dev/null || echo "MISSING: $ref"
done
# Check tool consistency
grep "allowed-tools:" skills/{name}/SKILL.md
grep -oE '(Read|Write|Edit|Bash|Grep|Glob|Task|WebSearch)' skills/{name}/SKILL.md | sort -u
# Check anti-rationalization reference
grep -c "anti-rationalization-core" skills/{name}/SKILL.md
Gate: All integration checks complete. Proceed only when gate passes.
Goal: Compile all findings into the standard report format.
Show all test results with individual scores — never summarize as "all tests pass." Sort findings by impact (HIGH / MEDIUM / LOW). Include specific, actionable recommendations with file paths and line numbers. When batch evaluating, show how each item compares to collection averages; do not report "most are good quality" without quantitative data.
This phase is read-only: report findings but never modify agents or skills. Use skill-creator for fixes. Clean up any intermediate analysis files created during evaluation.
Use the report template from references/report-templates.md. The report MUST include:
Issue Priority Classification:
| Priority | Criteria | Examples |
|---|---|---|
| HIGH | Missing required section or broken functionality | No Operator Context, syntax errors in scripts |
| MEDIUM | Section present but incomplete or non-compliant | Wrong item counts, old allowed-tools format |
| LOW | Cosmetic or minor quality issues | Untagged code blocks, missing changelog |
Grade Boundaries:
| Score | Grade | Interpretation |
|---|---|---|
| 90-100 | A | Production ready, exemplary |
| 80-89 | B | Good, minor improvements needed |
| 70-79 | C | Adequate, some gaps to address |
| 60-69 | D | Below standard, significant work needed |
| <60 | F | Major overhaul required |
Gate: Report generated with all sections populated and evidence cited. Evaluation complete.
User says: "Evaluate the test-driven-development skill" Actions:
skills/testing/test-driven-development/ exists (IDENTIFY)User says: "Audit all agents and skills" Actions:
User says: "Check if systematic-refactoring skill meets v2 standards" Actions:
skills/systematic-refactoring/ exists (IDENTIFY)allowed-tools, pipe description, version 2.0.0 (STRUCTURAL)Cause: Agent or skill path incorrect, or item was deleted
Solution: Verify path exists with ls before evaluation. If truly missing, exclude from batch and note in report.
Cause: Malformed YAML — missing --- delimiters, bad indentation, or invalid syntax
Solution: Flag as HIGH priority structural failure. Score YAML section as 0/10. Include the specific parse error in the report.
Cause: Validation script has syntax issues
Solution: Run python3 -m py_compile and capture the specific error. Score validation script as 0/10. Include error output in report.
Cause: v2 standard requires Hardcoded 5-8, Default 5-8, Optional 3-5 items. Skill has too few or too many. Solution:
- **)${CLAUDE_SKILL_DIR}/references/scoring-rubric.md - Full/partial/no credit breakdowns per rubric category${CLAUDE_SKILL_DIR}/references/report-templates.md - Standard report format templates (single, batch, comparison)${CLAUDE_SKILL_DIR}/references/common-issues.md - Frequently found issues with fix templates${CLAUDE_SKILL_DIR}/references/batch-evaluation.md - Batch evaluation procedures and collection summary format