with one click
judge
// Evaluates the execution quality of any skill or agent using 7-dimension scoring with configurable rubrics
// Evaluates the execution quality of any skill or agent using 7-dimension scoring with configurable rubrics
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | judge |
| displayName | Verdict — Universal Quality Evaluator |
| description | Evaluates the execution quality of any skill or agent using 7-dimension scoring with configurable rubrics |
| version | 2.0.2 |
| author | Sattyam Jain |
| allowed-tools | ["Read","Write","Edit","Bash"] |
| autoActivate | ["when the user asks to judge, evaluate, score, or rate a skill's output","when the user asks about skill quality or execution quality","when referenced by /judge command"] |
Verdict is a universal quality evaluator for Claude Code skills and agents. It measures execution quality across 7 weighted dimensions, producing an evidence-based scorecard with letter grades, justifications, and actionable recommendations.
Verdict operates in two modes:
Stop) and automatically evaluates every execution. No user intervention required. Scores are persisted to skills/judge/scores/ for trend analysis./judge command. The user specifies a skill name and optionally a transcript path. Useful for on-demand evaluation, re-scoring, or benchmarking.Both modes produce the same structured scorecard output.
Verdict evaluates across 7 dimensions. Each dimension receives a score from 1.0 to 10.0. The weighted composite determines the final grade.
| # | Dimension | Weight | What It Measures |
|---|---|---|---|
| 1 | Correctness | 25% | Output is factually correct. Code compiles and runs. No logical errors or bugs. |
| 2 | Completeness | 20% | All requirements from the prompt/task are addressed. Nothing is missing or skipped. |
| 3 | Adherence | 15% | The skill/agent followed its own SKILL.md or agent definition instructions precisely. |
| 4 | Actionability | 15% | Output is immediately usable without further manual work, fixes, or interpretation. |
| 5 | Efficiency | 10% | Minimal token waste. Appropriate tool usage. No unnecessary steps or redundant calls. |
| 6 | Safety | 10% | No harmful outputs. No data leaks. No destructive or irreversible actions taken without confirmation. |
| 7 | Consistency | 5% | Quality matches or exceeds previous executions of the same skill/agent. |
Composite formula:
composite = (correctness * 0.25) + (completeness * 0.20) + (adherence * 0.15)
+ (actionability * 0.15) + (efficiency * 0.10) + (safety * 0.10)
+ (consistency * 0.05)
Follow these steps exactly when performing an evaluation:
Determine which skill or agent just executed. Check:
/judge command argument, ORLoad the full execution transcript. This includes:
Look for a domain-specific rubric in skills/judge/rubrics/. The scoring engine resolves rubrics in this order:
{skill-name}.md (e.g., code-review.md for the code-review skill)code-review-v2 tries code-review.md)default.mdAvailable rubrics:
code-review.md — for code review and engineering skillsfrontend-design.md — for frontend and UI design skillsdocumentation.md — for writing and documentation skillstesting.md — for testing and QA skillssecurity.md — for security audit and hardening skillscontent-writing.md — for content creation and copywritingdata-analysis.md — for data analysis and visualizationresearch.md — for research and exploration skillsdevops.md — for DevOps and infrastructure skillsdefault.md — universal fallback for unmatched domainsFor each of the 7 dimensions:
Apply the weights from the table above to calculate the composite score.
Map the composite score to a letter grade using the grade scale below.
Produce 1-3 actionable recommendations based on the lowest-scoring dimensions. Focus on concrete improvements, not generic advice.
Write the structured JSON scorecard to skills/judge/scores/{skill-name}-{timestamp}.json.
| Grade | Composite Range | Description |
|---|---|---|
| A+ | 9.5 - 10.0 | Exceptional |
| A | 9.0 - 9.4 | Excellent |
| A- | 8.5 - 8.9 | Very Good |
| B+ | 8.0 - 8.4 | Good |
| B | 7.5 - 7.9 | Above Average |
| B- | 7.0 - 7.4 | Satisfactory |
| C+ | 6.5 - 6.9 | Adequate |
| C | 6.0 - 6.4 | Below Average |
| C- | 5.5 - 5.9 | Poor |
| D | 4.0 - 5.4 | Failing |
| F | 0.0 - 3.9 | Unacceptable |
Every evaluation produces a visual scorecard rendered in the terminal:
╔═══════════════════════════════════════════════════════════╗
║ VERDICT SCORECARD — {skill-name} ║
╠═══════════════════════════════════════════════════════════╣
║ Correctness ████████░░ 8.0/10 {justification} ║
║ Completeness ██████░░░░ 6.0/10 {justification} ║
║ Adherence █████████░ 9.0/10 {justification} ║
║ Actionability ████████░░ 8.0/10 {justification} ║
║ Efficiency ███████░░░ 7.0/10 {justification} ║
║ Safety ██████████ 10.0/10 {justification} ║
║ Consistency ████████░░ 8.0/10 {justification} ║
╠═══════════════════════════════════════════════════════════╣
║ COMPOSITE: {score}/10 — Grade: {grade} ║
║ {critical issues if any} ║
║ {top recommendation} ║
╚═══════════════════════════════════════════════════════════╝
The progress bars use filled blocks (█) and empty blocks (░) proportional to the score. Each bar is 10 characters wide (1 block per point).
When auto mode is enabled, Verdict hooks into the Stop lifecycle event. After any skill or agent finishes execution, the hook script:
Auto mode is controlled by the autoJudge setting in judge-config.json. When set to true, every skill execution is automatically evaluated. When false, only manual /judge invocations trigger evaluation.
Users invoke /judge directly:
/judge commit — Judge the last /commit execution
/judge <skill-name> — Judge the last execution of a named skill
/judge --file <path> — Judge a specific transcript file
Manual mode is always available regardless of the autoJudge setting.
Rubrics are domain-specific scoring guidelines stored in skills/judge/rubrics/. Each rubric refines the 7 base dimensions with domain-appropriate criteria.
| Rubric File | Domain | When Used |
|---|---|---|
default.md | General | Fallback for any unmatched skill/agent |
code-review.md | Code & Engineering | Skills that write, modify, or review code |
frontend-design.md | Frontend & UI | Skills that build or design user interfaces |
documentation.md | Writing & Documentation | Skills that produce prose, docs, or reports |
testing.md | Testing & QA | Skills that write or run tests |
security.md | Security | Skills that audit, scan, or harden security |
content-writing.md | Content Creation | Skills that create marketing or editorial content |
data-analysis.md | Data & Analytics | Skills that analyze data or create visualizations |
research.md | Research & Exploration | Skills that search, explore, or investigate |
devops.md | DevOps & Infrastructure | Skills that manage infra, deploy, or configure |
custom-template.md | Template | Copy this to create a new domain-specific rubric |
To add a custom rubric, copy custom-template.md and rename it to match your skill name or domain.
When you are activated as the Verdict evaluator (either via auto hook or /judge command), follow these instructions precisely:
skills/judge/rubrics/. The scoring engine tries exact match first ({skill-name}.md), then category prefix, then default.md.Read the entire transcript. Pay attention to:
For each dimension, ask yourself the calibration question:
Every score MUST cite specific evidence from the transcript. Examples:
git push. No secrets exposed."Do NOT give vague justifications like "Generally good" or "Seems fine." Every justification must reference concrete evidence.
Use the full range of the scale. Do not cluster all scores around 7-8.
A score of 10.0 should be rare and reserved for truly flawless execution. A score of 5.0 is not "average" — it means the output is barely acceptable and needs significant work.
Write the JSON scorecard to skills/judge/scores/{skill-name}-{YYYYMMDD-HHMMSS}.json with this structure:
{
"skill": "{skill-name}",
"timestamp": "{ISO-8601}",
"dimensions": {
"correctness": { "score": 8.0, "weight": 0.25, "justification": "..." },
"completeness": { "score": 6.0, "weight": 0.20, "justification": "..." },
"adherence": { "score": 9.0, "weight": 0.15, "justification": "..." },
"actionability": { "score": 8.0, "weight": 0.15, "justification": "..." },
"efficiency": { "score": 7.0, "weight": 0.10, "justification": "..." },
"safety": { "score": 10.0, "weight": 0.10, "justification": "..." },
"consistency": { "score": 8.0, "weight": 0.05, "justification": "..." }
},
"composite": 7.85,
"grade": "B",
"criticalIssues": [],
"recommendations": ["..."],
"rubricUsed": "rubric-default.md",
"transcriptPath": "..."
}
default.md and note it in the output.StopFailure hook in hooks/hooks.json handles this; /judge
invoked manually on such a transcript should note the API error
rather than score it.score.py --adapter NAME to extract lines via the matching
adapter in skills/judge/adapters/.Any rubric can override the global dimension weights with a sibling
<rubric>.weights.json:
{
"correctness": 0.20,
"completeness": 0.15,
"adherence": 0.10,
"actionability": 0.10,
"efficiency": 0.05,
"safety": 0.35,
"consistency": 0.05
}
Sum must equal 1.0 (±1e-6). The shipped security.weights.json
applies 0.35 to safety so security-audit transcripts weight that
dimension above correctness.
score.py auto-detects the model from JSONL transcripts via the
standard "model": "<id>" field. The efficiency analyser scales its
long-transcript thresholds (2000 and 1000 lines) by a per-model
baseline from judge-config.json.tokenizer_baselines. Ships with
claude-opus-4-7: 1.35 to absorb Opus 4.7's new tokenizer, which
produces up to 35% more tokens than Opus 4.6 for the same text. Add
an entry per model you care about, or override the default key.
Use --adapter NAME to score transcripts from other ecosystems:
claude-code — native JSONL (default).cowork — Claude Cowork sessions.openai-compatible — Cursor, Continue, and any tool using the
standard OpenAI chat-completion format.codex — OpenAI Codex CLI (markdown sessions plus JSON sidecars).cursor, continue — aliases for openai-compatible./judge --againstDelegates to skills/judge/scripts/against.py. Picks two scorecards
for the same skill, renders a Unicode delta table, and exits 2 on
composite regression. Useful for CI gates and manual "did this change
help" checks.
python3 scripts/validate_marketplace.py — stdlib-only schema check
for .claude-plugin/marketplace.json against the April 2026 spec.python3 scripts/benchmark_pack.py — runs the curated corpus in
benchmarks/ and asserts each case satisfies its expected bounds.
Wire into CI to catch heuristic regressions.(The opt-in LLM second-opinion analyzer at analyzers/llm_judge.py
mirrors the cross-family critic pattern reaffirmed by GitHub Copilot
CLI's Rubber Duck on 2026-05-07. Off by default; enable via
judge-config.json.llm_second_opinion.enabled.)