Run any Skill in Manus with one click

agentic-eval

Evaluate and improve AI-generated output with explicit rubrics, reflection loops, and stop conditions. Use when building self-critique workflows, evaluator-optimizer pipelines, or acceptance gates for code, docs, analysis, or plans.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/PracticalSwan/agent-skills --skill agentic-eval

Copy and paste this command into Claude Code to install the skill

Source

PracticalSwan/agent-skills

Stars3

Forks1

UpdatedApril 24, 2026 at 17:25

File Explorer

5 files

SKILL.md

readonly

name	agentic-eval
version	1.2
last_updated	"2026-04-25T00:00:00.000Z"
tags	["agentic","eval","agents","delegation","workflow"]
description	Evaluate and improve AI-generated output with explicit rubrics, reflection loops, and stop conditions. Use when building self-critique workflows, evaluator-optimizer pipelines, or acceptance gates for code, docs, analysis, or plans.

Agentic Eval

Use structured evaluation loops to improve important outputs before you call them done.

Leverage native parallel subagent dispatch and 200k+ context windows where available.

When to Use

Use symptom -> action triggers: when one matches, apply this skill and verify with the protocol below.

A task is quality-critical and a single pass is too risky.
You need repeatable acceptance criteria for code, docs, analysis, or plans.
You want a reviewer or judge step that is separate from generation.
You need to compare multiple candidate outputs against the same rubric.

Core Loop

Define the artifact being judged.
Define a rubric with weighted dimensions.
Generate or collect the candidate output.
Evaluate it against the rubric.
Convert the feedback into concrete changes.
Re-run until the score crosses the threshold or the iteration budget is exhausted.

Evaluation Patterns

1. Self-Reflection

Use the same agent to critique its own work when the task is moderate risk and the rubric is precise.

Best for:

formatting checks
completeness checks
first-pass code or doc refinement

2. Evaluator-Optimizer Split

Separate generation from evaluation when you want clearer responsibilities.

Best for:

high-value outputs
rubric-based acceptance checks
comparing multiple candidates fairly

3. Evidence-Based Evaluation

Back the score with tests, logs, benchmarks, or direct verification.

Best for:

code generation
migration plans
architecture recommendations
security or compliance review

Rubric Design Rules

Keep dimensions few and concrete.
Weight the business-critical dimension highest.
Define what a passing score means before evaluation starts.
Require written evidence for any failing dimension.
Stop when you are no longer learning new fixes.

Suggested dimensions:

correctness
completeness
clarity
maintainability
risk management
evidence quality

Stop Conditions

Stop the loop when one of these becomes true:

the overall threshold is met
the failing dimensions are now low-impact only
tests or verification evidence already prove the output is acceptable
the score has stopped improving and more iterations are likely noise

Output Format

Use a structure like this when reporting an evaluation:

## Evaluation Summary

### Artifact
- Short description of what was evaluated

### Rubric Results
| Dimension | Weight | Score | Notes |
|-----------|--------|-------|-------|
| correctness | 0.40 | 4/5 | Main logic is sound |

### Overall
- Weighted score: 0.84
- Threshold: 0.80
- Result: PASS

### Required Improvements
- Tighten edge-case handling around ...
- Add verification evidence for ...

Self-Verification Phase-Gate Questions

Before you claim the evaluation is complete, the evaluating agent must ask:

Did I define the rubric, threshold, and evidence sources explicitly enough for another agent to rerun the check?
Did every failing dimension produce a concrete improvement action instead of a vague critique?
Did I stop because the result is acceptable, or only because I ran out of patience?
Can I point to tests, logs, screenshots, or scorecards that support the final PASS or FAIL decision?

Anti-Patterns

Delegating or evaluating without a scoped success condition: The output becomes hard to review and easy to overbuild.
Skipping the evidence step: A workflow that cannot be re-checked quickly is not ready for handoff.
Bundling unrelated subtasks together: It creates noisy prompts, weaker ownership, and avoidable integration risk.

Verification Protocol

Before claiming "skill applied successfully":

Pass/fail: The Agentic Eval workflow names the agent boundary, delegated scope, and expected return artifact.
Pass/fail: Context passed to helpers is minimal, task-local, and free of hidden expected answers.
Pass/fail: Results are integrated only after evidence, diffs, or citations are checked by the controller.
Pressure-test scenario: Run the workflow on two similar tasks that must not share assumptions or leaked context.
Success metric: Zero context leakage; every delegated output is independently reviewable.

Scripts And References

Best Practices

Keep the rubric stable across iterations so the score means something.
Prefer evidence-backed criteria over taste-based criteria.
Store the final rubric and score with the task when the output matters later.
Pair with tests or direct verification whenever the artifact can be executed.
If you use an LLM judge, constrain the output format so it can be parsed and compared.

Cross-Client Portability

This skill is written to stay usable across GitHub Copilot, Claude Code, Codex, and Gemini CLI.

GitHub Copilot: keep the folder in a Copilot-visible skill or plugin path, or wrap the workflow as project instructions if the host does not support portable skill folders directly.
Claude Code: keep the folder in a local skills directory or a compatible plugin or marketplace source.
Codex: install or sync the folder into $CODEX_HOME/skills/<skill-name> and restart Codex after major changes.
Gemini CLI: this repository generates a project command named /skills:agentic-eval from this skill. Rebuild commands with python scripts/export-gemini-skill.py agentic-eval and then run /commands reload inside Gemini CLI.

MCP Availability And Fallback

Preferred MCP Server: None required

Fallback prompt: "Use the Agentic Eval skill without MCP. Rely on the local SKILL.md, bundled references or scripts, and manual verification. Show the exact commands, evidence, and final checks you used before concluding."
If the current host does not expose a matching server, use the bundled references, scripts, native toolchain, and manual workflow already described in this skill.
Treat direct local verification, rendered output, logs, tests, or screenshots as the fallback evidence path before completion.

Related Skills

agent-task-mapping: Use it when the workflow also needs task-to-agent routing decisions.
custom-agent-usage: Use it when the workflow also needs loading and invoking custom agent definitions safely.
subagent-delegation: Use it when the workflow also needs safe, scoped delegation to helper agents.
subagent-driven-development: Use it when the workflow also needs plan-driven implementation with reviewer loops.

Agentic Eval

Use structured evaluation loops to improve important outputs before you call them done.

Leverage native parallel subagent dispatch and 200k+ context windows where available.

When to Use

Use symptom -> action triggers: when one matches, apply this skill and verify with the protocol below.

A task is quality-critical and a single pass is too risky.
You need repeatable acceptance criteria for code, docs, analysis, or plans.
You want a reviewer or judge step that is separate from generation.
You need to compare multiple candidate outputs against the same rubric.

Core Loop

Define the artifact being judged.
Define a rubric with weighted dimensions.
Generate or collect the candidate output.
Evaluate it against the rubric.
Convert the feedback into concrete changes.
Re-run until the score crosses the threshold or the iteration budget is exhausted.

Evaluation Patterns

1. Self-Reflection

Use the same agent to critique its own work when the task is moderate risk and the rubric is precise.

Best for:

formatting checks
completeness checks
first-pass code or doc refinement

2. Evaluator-Optimizer Split

Separate generation from evaluation when you want clearer responsibilities.

Best for:

high-value outputs
rubric-based acceptance checks
comparing multiple candidates fairly

3. Evidence-Based Evaluation

Back the score with tests, logs, benchmarks, or direct verification.

Best for:

code generation
migration plans
architecture recommendations
security or compliance review

Rubric Design Rules

Keep dimensions few and concrete.
Weight the business-critical dimension highest.
Define what a passing score means before evaluation starts.
Require written evidence for any failing dimension.
Stop when you are no longer learning new fixes.

Suggested dimensions:

correctness
completeness
clarity
maintainability
risk management
evidence quality

Stop Conditions

Stop the loop when one of these becomes true:

the overall threshold is met
the failing dimensions are now low-impact only
tests or verification evidence already prove the output is acceptable
the score has stopped improving and more iterations are likely noise

Output Format

Use a structure like this when reporting an evaluation:

## Evaluation Summary

### Artifact
- Short description of what was evaluated

### Rubric Results
| Dimension | Weight | Score | Notes |
|-----------|--------|-------|-------|
| correctness | 0.40 | 4/5 | Main logic is sound |

### Overall
- Weighted score: 0.84
- Threshold: 0.80
- Result: PASS

### Required Improvements
- Tighten edge-case handling around ...
- Add verification evidence for ...

Self-Verification Phase-Gate Questions

Before you claim the evaluation is complete, the evaluating agent must ask:

Did I define the rubric, threshold, and evidence sources explicitly enough for another agent to rerun the check?
Did every failing dimension produce a concrete improvement action instead of a vague critique?
Did I stop because the result is acceptable, or only because I ran out of patience?
Can I point to tests, logs, screenshots, or scorecards that support the final PASS or FAIL decision?

Anti-Patterns

Delegating or evaluating without a scoped success condition: The output becomes hard to review and easy to overbuild.
Skipping the evidence step: A workflow that cannot be re-checked quickly is not ready for handoff.
Bundling unrelated subtasks together: It creates noisy prompts, weaker ownership, and avoidable integration risk.

Verification Protocol

Before claiming "skill applied successfully":

Pass/fail: The Agentic Eval workflow names the agent boundary, delegated scope, and expected return artifact.
Pass/fail: Context passed to helpers is minimal, task-local, and free of hidden expected answers.
Pass/fail: Results are integrated only after evidence, diffs, or citations are checked by the controller.
Pressure-test scenario: Run the workflow on two similar tasks that must not share assumptions or leaked context.
Success metric: Zero context leakage; every delegated output is independently reviewable.

Scripts And References

Best Practices

Keep the rubric stable across iterations so the score means something.
Prefer evidence-backed criteria over taste-based criteria.
Store the final rubric and score with the task when the output matters later.
Pair with tests or direct verification whenever the artifact can be executed.
If you use an LLM judge, constrain the output format so it can be parsed and compared.

Cross-Client Portability

This skill is written to stay usable across GitHub Copilot, Claude Code, Codex, and Gemini CLI.

GitHub Copilot: keep the folder in a Copilot-visible skill or plugin path, or wrap the workflow as project instructions if the host does not support portable skill folders directly.
Claude Code: keep the folder in a local skills directory or a compatible plugin or marketplace source.
Codex: install or sync the folder into $CODEX_HOME/skills/<skill-name> and restart Codex after major changes.
Gemini CLI: this repository generates a project command named /skills:agentic-eval from this skill. Rebuild commands with python scripts/export-gemini-skill.py agentic-eval and then run /commands reload inside Gemini CLI.

MCP Availability And Fallback

Preferred MCP Server: None required

Fallback prompt: "Use the Agentic Eval skill without MCP. Rely on the local SKILL.md, bundled references or scripts, and manual verification. Show the exact commands, evidence, and final checks you used before concluding."
If the current host does not expose a matching server, use the bundled references, scripts, native toolchain, and manual workflow already described in this skill.
Treat direct local verification, rendered output, logs, tests, or screenshots as the fallback evidence path before completion.

Related Skills

agent-task-mapping: Use it when the workflow also needs task-to-agent routing decisions.
custom-agent-usage: Use it when the workflow also needs loading and invoking custom agent definitions safely.
subagent-delegation: Use it when the workflow also needs safe, scoped delegation to helper agents.
subagent-driven-development: Use it when the workflow also needs plan-driven implementation with reviewer loops.

agentic-eval

Agentic Eval

When to Use

Core Loop

Evaluation Patterns

1. Self-Reflection

2. Evaluator-Optimizer Split

3. Evidence-Based Evaluation

Rubric Design Rules

Stop Conditions

Output Format

Self-Verification Phase-Gate Questions

Anti-Patterns

Verification Protocol

Scripts And References

Best Practices

Cross-Client Portability

MCP Availability And Fallback

Related Skills

More from this repository

More from this repository

Agentic Eval

When to Use

Core Loop

Evaluation Patterns

1. Self-Reflection

2. Evaluator-Optimizer Split

3. Evidence-Based Evaluation

Rubric Design Rules

Stop Conditions

Output Format

Self-Verification Phase-Gate Questions

Anti-Patterns

Verification Protocol

Scripts And References

Best Practices

Cross-Client Portability

MCP Availability And Fallback

Related Skills