Run any Skill in Manus with one click

agentic-eval

Design and implement evaluation loops for AI agents, including reflection, evaluator-optimizer patterns, rubric scoring, LLM-as-judge review, test-driven refinement, convergence checks, and iteration logging.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/MarieLynneBlock/arcanum-artifex --skill agentic-eval

Copy and paste this command into Claude Code to install the skill

Source

MarieLynneBlock/arcanum-artifex

Stars2

Forks0

UpdatedMay 20, 2026 at 13:27

SKILL.md

readonly

More from this repository

same repository

agent-governance-patterns

MarieLynneBlock/arcanum-artifex

Design and implement governance controls for tool-using and multi-agent AI systems, including policy enforcement, approval gates, audit trails, trust scoring, rate limits, and safe tool execution.

2026-05-202

agent-owasp-compliance

MarieLynneBlock/arcanum-artifex

Assess tool-using and multi-agent AI systems against OWASP ASI-style controls, mapping evidence to prompt injection, tool governance, agency, escalation, trust boundaries, audit, identity, policy integrity, supply chain, and behavioural monitoring risks.

2026-05-202

agent-supply-chain

MarieLynneBlock/arcanum-artifex

Review, generate, and verify supply-chain integrity controls for AI agent tools, plugins, MCP servers, skills, prompts, and custom agents, including SHA-256 manifests, dependency pinning, provenance evidence, promotion gates, and CI verification.

2026-05-202

code-to-coach

MarieLynneBlock/arcanum-artifex

Deconstruct code into bite-sized, logical chunks from a coaching perspective. Remove abstraction, explain the "why" behind each concept, identify best practices, code smells, and cognitive load issues. Use structured frameworks to provide thorough, pedagogical analysis that teaches *how to think* about code, not just what it does.

2026-05-162

4plus1-models

MarieLynneBlock/arcanum-artifex

Produce Philippe Kruchten's 4+1 architectural view model for a software system. This is the core method skill: audience routing, concerns, per-view generation, and cross-view consistency. It outputs canonical diagram-as-code (Mermaid / PlantUML) plus view prose, and does not own draw.io or Miro rendering.

2026-05-162

draw-io-diagram-generator

MarieLynneBlock/arcanum-artifex

Use when creating, editing, or generating draw.io diagram files (.drawio, .drawio.svg, .drawio.png). Covers mxGraph XML authoring, shape libraries, style strings, flowcharts, system architecture, sequence diagrams, ER diagrams, UML class diagrams, network topology, layout strategy, the hediet.vscode-drawio VS Code extension, and the full agent workflow from request to a ready-to-open file.

2026-05-162

Source

MarieLynneBlock

MarieLynneBlock/arcanum-artifex

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	agentic-eval
description	Design and implement evaluation loops for AI agents, including reflection, evaluator-optimizer patterns, rubric scoring, LLM-as-judge review, test-driven refinement, convergence checks, and iteration logging.
metadata	{"skill-author":"Marie-Lynne Block"}

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

Quality-critical generation: Code, reports, analysis requiring high accuracy
Tasks with clear evaluation criteria: Defined success metrics exist
Content requiring specific standards: Style guides, compliance, formatting
Iterative improvement workflows: Drafts, plans, prompts, or code need critique and refinement before delivery

Do Not Use This Skill For

Formal compliance claims, certification, or audit sign-off without approved organisational evidence
Production monitoring design, runtime governance, access control, or policy enforcement
Broad statistical evaluation methodology, benchmark design, or model selection studies
Security, privacy, or safety review where evaluation loops are not the central concern

Expected Outputs

When applying this skill, produce the assets needed to run and inspect an evaluation loop:

evaluation criteria or rubric
generator, evaluator, and optimiser responsibilities
iteration limit and convergence policy
structured evaluator output format
failure handling for malformed evaluations or non-improving outputs
logging fields for debugging and review
final recommendation or refined output with evaluation history summary

Treat code blocks in this skill as adaptable Python-style skeletons. Replace llm, run_tests, model settings, and logging functions with project-specific implementations.

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

import json


def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results, and handle parsing failures as evaluation failures rather than silently accepting the draft.

Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

import json


class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        previous_score = 0.0
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            current_score = evaluation["overall_score"]
            if current_score >= self.score_threshold:
                break
            if current_score <= previous_score:
                break
            previous_score = current_score
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

Evaluator Reliability

Evaluation loops are only useful when the evaluator is harder to fool than the generator. For high-risk or quality-critical work:

Anchor the evaluator with a concrete rubric and examples of passing and failing outputs.
Ask for evidence, failure reasons, and uncertainty, not just a score.
Randomise or blind comparison order when comparing alternatives to reduce position bias.
Repeat judging or use multiple evaluators when scores are close or the decision is consequential.
Escalate to human review when evaluator outputs conflict, confidence is low, or policy-sensitive content is involved.
Track whether scores improve over iterations; stop when improvement stalls to avoid overfitting to the evaluator.

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

Practice	Rationale
Clear criteria	Define specific, measurable evaluation criteria upfront
Iteration limits	Set max iterations (3-5) to prevent infinite loops
Convergence check	Stop if output score isn't improving between iterations
Log history	Keep full trajectory for debugging and analysis
Structured output	Use JSON for reliable parsing of evaluation results
Evaluator calibration	Use examples, rubrics, and evidence requirements to reduce judge drift
Human escalation	Route low-confidence or high-impact decisions to a reviewer

Failure Handling

Plan for evaluation failures before running the loop:

If evaluator JSON is malformed, retry once with the schema and then fail closed with the raw response logged.
If the evaluator gives no evidence, treat the score as unreliable and request evidence before refining.
If output quality stops improving, return the best-scored candidate and include the stopping reason.
If evaluation criteria conflict, pause and ask for priority order rather than optimising against incompatible goals.
If generated tests or checks are themselves suspect, review or regenerate the tests before trusting the score.

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully
- [ ] Escalate low-confidence or high-impact decisions to human review