Run any Skill in Manus with one click

evaluation-testing

Use this skill to design and execute evaluation frameworks for LLM agents, implement trajectory testing, deploy LLM-as-judge patterns, build automated eval pipelines, and integrate agent testing into CI/CD workflows. This skill enforces: structured behavioral assertions, trajectory-vs-outcome evaluation matrices, verifier agent topologies, regression detection baselines, hallucination scoring engines, and benchmark dataset lifecycle management. Do NOT use for: unit testing traditional software, load/performance testing infrastructure, or model fine-tuning data preparation.

Run Skill in Manus

Stars7

Forks0

UpdatedJune 5, 2026 at 09:02

Source

j4flmao

j4flmao/agent-skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

File Explorer

9 files

SKILL.md

readonly

More from this repository

same repository

agent-legibility

j4flmao/agent-skills

Use this skill to make codebases, repositories, and documentation optimally readable and navigable by AI coding agents. Covers AGENTS.md design, repo-native instruction files, convention and constraint files, progressive context disclosure patterns, agent-optimized README structures, and workspace configuration. This skill enforces: structured metadata files, layered context loading, navigation hint systems, and machine-parseable documentation conventions. Do NOT use for: human-only documentation styling, marketing copy, or API reference generation.

2026-06-057

agent-observability

j4flmao/agent-skills

Comprehensive skill for tracing reasoning paths, debugging non-deterministic agent loops, and monitoring agent behavior in production systems. Covers reasoning trace visualization, OpenTelemetry integration for agent systems, distributed tracing across multi-agent chains, decision audit logging, performance profiling, anomaly detection, cost tracking and optimization, and latency analysis for AI agent deployments.

2026-06-057

architectural-constraints

j4flmao/agent-skills

Defines, monitors, and enforces execution-level sandboxing, performance SLA boundaries, resource limits, security isolation, network egress filters, compliance tracking, and transactional state updates. This skill enforces: resource throttling, PII scrubbers, import restrictions, network proxy compliance, atomic file locks, and circuit breakers. Do NOT use for: basic UI prompt formatting, developer code style checks, or application routing.

2026-06-057

context-engineering

j4flmao/agent-skills

Use this skill to optimize and engineer prompt context windows, manage token budgets, implement dynamic context injections, handle state management, and mitigate semantic drift in LLM agent cycles. This skill enforces: structured context priority scoring, token-budget calculations, crash-resilient persistent state adapters, and drift correction pipelines. Do NOT use for: basic prompt copywriting, model evaluation datasets, or general fine-tuning prep.

2026-06-057

error-recovery

j4flmao/agent-skills

Use this skill to classify agent failures, implement retry strategies with exponential backoff and jitter, design checkpoint-based state recovery, build fallback chains, manage dead letter queues, enforce error budgets, and apply chaos testing to LLM agent systems. This skill enforces: structured error taxonomies, idempotent retry logic, crash-resilient checkpoint persistence, graceful degradation cascades, and probabilistic failure injection frameworks. Do NOT use for: traditional application error handling, infrastructure monitoring/alerting, or network-level fault tolerance.

2026-06-057

feedback-loops

j4flmao/agent-skills

Use this skill to implement self-correction, reflection, human-in-the-loop (HITL), and verification layers that allow AI agents to evaluate and improve their own outputs. Covers Implement-Verify-Fix cycles, reflection patterns, HITL checkpoints, output verification, automated linting hooks, multi-stage validation, correction triggers, and quality gates. This skill enforces: structured IVF cycles, multi-layer output verification, HITL checkpoint protocols, and continuous improvement feedback mechanisms. Do NOT use for: pre-execution planning, intent classification, goal decomposition, or feedforward control mechanisms.

2026-06-057

name	evaluation-testing
description	Use this skill to design and execute evaluation frameworks for LLM agents, implement trajectory testing, deploy LLM-as-judge patterns, build automated eval pipelines, and integrate agent testing into CI/CD workflows. This skill enforces: structured behavioral assertions, trajectory-vs-outcome evaluation matrices, verifier agent topologies, regression detection baselines, hallucination scoring engines, and benchmark dataset lifecycle management. Do NOT use for: unit testing traditional software, load/performance testing infrastructure, or model fine-tuning data preparation.
version	2.0.0
author	j4flmao
license	MIT
type	skill
compatibility	{"claude-code":true,"cursor":true,"codex":true,"windsurf":true}
tags	["harness-engineering","evaluation-testing","agent-frameworks","llm-judge","ci-cd","benchmarks"]

Evaluation Testing Skill

Purpose

Provides a production-grade evaluation and testing framework for LLM agent systems. Enables teams to measure agent correctness across behavioral dimensions, detect regressions in multi-step reasoning chains, score hallucination severity, and embed automated evaluation gates into deployment pipelines. This system handles the fundamental non-determinism of LLM outputs by combining trajectory-level analysis, outcome-level assertions, and LLM-as-judge consensus protocols into a unified testing harness.

Core Principles

Trajectory Over Outcome: Evaluate the reasoning path, not just the final answer. An agent that reaches the correct output through flawed reasoning is a latent failure.
Statistical Significance Over Single Runs: Agent evaluations must use repeated sampling ($N \ge 5$) and report confidence intervals, never single-shot pass/fail assertions.
Human-Aligned Judging: LLM-as-judge evaluators must be calibrated against human preference baselines using Cohen's Kappa ($\kappa \ge 0.60$) before deployment.
Regression Baselines Are Sacred: Every eval suite must maintain versioned baseline snapshots. Regressions are detected against these baselines, not arbitrary thresholds.
Eval Datasets Are Living Assets: Test datasets must be versioned, deduplicated, stratified by difficulty, and refreshed on a scheduled cadence to prevent benchmark overfitting.

Agent Protocol

Triggers

Use this skill when processing:

Agent output quality assessments requiring structured scoring rubrics.
Multi-step trajectory evaluations for tool-calling or chain-of-thought agents.
CI/CD pipeline gates that must block deployments on eval regressions.
Hallucination detection across factual claims, code generation, or document summarization.
Benchmark design for comparing agent architectures or prompt strategies.
Dataset curation for evaluation test suites.

Input Context Required

Agent Outputs: Raw completions, tool call traces, or conversation transcripts to evaluate.
Evaluation Rubric: A structured scoring guide defining dimensions (correctness, helpfulness, safety, coherence).
Baseline Metrics: Historical eval scores from the previous accepted version.
Ground Truth Dataset: Labeled examples with expected outputs or acceptable output ranges.
Target Confidence Level ($\alpha$): The statistical significance threshold (typically 0.05).

Output Artifact

Evaluation Report: JSON document containing per-dimension scores, aggregate metrics, and statistical tests.
Regression Verdict: Binary pass/fail with confidence intervals and effect sizes.
Hallucination Audit Log: Itemized list of factual claims with verification status and source attributions.

Response Formats

For programmatic integration, evaluation results must be delivered in this format:

{
  "eval_run_id": "eval-2026-06-04-001",
  "model_version": "agent-v2.3.1",
  "dimensions": {
    "correctness": { "mean": 0.87, "ci_lower": 0.82, "ci_upper": 0.92, "n": 200 },
    "helpfulness": { "mean": 0.91, "ci_lower": 0.87, "ci_upper": 0.95, "n": 200 },
    "safety": { "mean": 0.99, "ci_lower": 0.97, "ci_upper": 1.00, "n": 200 }
  },
  "regression_detected": false,
  "hallucination_rate": 0.034,
  "verdict": "PASS",
  "baseline_comparison": {
    "previous_version": "agent-v2.3.0",
    "delta_correctness": +0.02,
    "p_value": 0.12
  }
}

Decision Matrix for Evaluation Strategy

What kind of agent output are you evaluating?
├── Single-Turn Q&A / Factual Responses
│   ├── Ground truth available?
│   │   ├── Yes → Exact Match / F1 / BLEU + Hallucination Scoring
│   │   └── No  → LLM-as-Judge (Pointwise) + Human Calibration
│   │
├── Multi-Step Tool-Calling Chains
│   ├── Trajectory matters?
│   │   ├── Yes → Trajectory Evaluation (step-level scoring)
│   │   └── No  → Outcome-Only Evaluation (final state diff)
│   │
├── Code Generation
│   ├── Executable test cases available?
│   │   ├── Yes → Execution-Based Pass@k Scoring
│   │   └── No  → LLM-as-Judge (Pairwise Comparison)
│   │
└── Long-Form Content / Summarization
    ├── Reference summaries available?
    │   ├── Yes → ROUGE-L + BERTScore + Faithfulness Check
    │   └── No  → LLM-as-Judge (Rubric-Based) + Hallucination Audit

Detailed Architectural Overview

The evaluation testing framework operates as a pipeline from agent output collection through scoring, aggregation, and regression analysis.

+----------------+     +-----------------+     +------------------+     +-------------------+
| Agent Runtime  | ──► | Trace Collector | ──► | Eval Dispatcher  | ──► | Scoring Engines   |
| (completions)  |     | (trajectories)  |     | (routes by type) |     | (judge/metric/exec)|
+----------------+     +-----------------+     +------------------+     +-------------------+
                                                                                  │
                                                                                  ▼
+----------------+                                                       +-------------------+
| CI/CD Gateway  | ◄─────────────────────────────────────────────────── | Aggregator &      |
| (pass/fail)    |                                                       | Regression Tester |
+----------------+                                                       +-------------------+

Evaluation Lifecycle

[Agent Produces Output]
       │
       ├──► (A) Trace Collection ──► Captures tool calls, reasoning steps, final output
       │
       ├──► (B) Eval Routing ──► Matches output type to scoring strategy (judge/metric/exec)
       │
       ├──► (C) Multi-Dimensional Scoring ──► $S_d = \frac{1}{N}\sum_{i=1}^{N} J_d(o_i, r_i)$
       │
       ├──► (D) Statistical Aggregation ──► Computes means, CIs, effect sizes (Cohen's d)
       │
       └──► (E) Regression Test ──► Two-sample t-test against baseline: $t = \frac{\bar{X}_1 - \bar{X}_2}{s_p\sqrt{2/n}}$

Workflow Steps

Phase 1: Trace Collection & Dataset Preparation

Instrument Agent Runtime: Attach trace collectors to capture every tool call, reasoning step, and final output in structured format.
Load Evaluation Dataset: Pull versioned test cases from the dataset registry with stratified sampling by difficulty tier.
Generate Agent Outputs: Execute the agent against all test cases with temperature fixed and random seed locked for reproducibility.
Serialize Trajectories: Store complete execution traces (inputs, intermediate states, outputs) in JSONL format.

Phase 2: Scoring Engine Selection

Classify Output Type: Determine whether each test case requires metric-based, execution-based, or judge-based evaluation.
Load Rubric Definitions: Bind dimension-specific scoring rubrics (correctness, helpfulness, safety, coherence) to the eval dispatcher.
Configure Judge Models: Initialize LLM-as-judge instances with calibrated system prompts and few-shot examples.
Set Sampling Parameters: Configure $N$ judge samples per item for consensus scoring.

Phase 3: Multi-Dimensional Evaluation

Execute Metric Evaluations: Run deterministic metrics (F1, BLEU, ROUGE-L, BERTScore) on applicable test cases.
Execute Judge Evaluations: Route subjective dimensions through LLM-as-judge with structured output schemas.
Execute Trajectory Evaluations: Score step-by-step reasoning chains against golden trajectories.
Run Hallucination Detection: Extract factual claims and verify against source documents.

Phase 4: Statistical Aggregation

Compute Dimension Means: Calculate per-dimension score averages with bootstrap confidence intervals.
Compute Effect Sizes: Calculate Cohen's d between current and baseline score distributions.
Run Normality Tests: Apply Shapiro-Wilk test to determine appropriate statistical comparison method.
Generate Score Distributions: Plot histograms and box plots for each evaluation dimension.

Phase 5: Regression Detection

Load Baseline Snapshots: Retrieve the most recent accepted baseline from the eval registry.
Execute Statistical Tests: Run paired t-tests or Wilcoxon signed-rank tests comparing current vs. baseline.
Apply Holm-Bonferroni Correction: Correct for multiple comparisons across evaluation dimensions.
Render Regression Verdict: Emit PASS/FAIL based on corrected p-values and minimum effect size thresholds.

Phase 6: CI/CD Integration & Reporting

Publish Eval Report: Write structured JSON report to the CI artifact store.
Update Baseline Registry: If verdict is PASS and metrics improve, promote current scores to the new baseline.
Gate Deployment: Block or allow deployment based on regression verdict and mandatory dimension thresholds.
Alert on Degradations: Send notifications for statistically significant regressions exceeding alert thresholds.

Extended Troubleshooting Guide

When implementing evaluation testing frameworks, you may encounter the following common failure modes:

Symptom	Primary Cause	Mitigation Action
High variance in LLM-as-judge scores	Judge prompt is underspecified or lacks calibration examples.	Add 3-5 few-shot examples covering edge cases and re-calibrate against human labels.
False regression alerts on every run	Baseline was captured from a single run without confidence intervals.	Re-capture baseline using $N \ge 50$ samples and store distribution parameters.
Hallucination scorer flags correct outputs	Verification source documents are incomplete or outdated.	Expand source corpus and add a confidence threshold ($\tau \ge 0.8$) before flagging.
CI pipeline times out during eval	Full eval suite runs against entire dataset on every commit.	Implement tiered eval: fast smoke tests on PR, full suite on merge to main.
Judge model agrees with itself (self-bias)	Same model used for generation and judging.	Use a different model family for judging or implement cross-model consensus.
Eval scores plateau despite agent improvements	Benchmark saturation — test cases are too easy.	Refresh dataset with adversarial examples targeting known failure modes.
Trajectory eval misses semantic equivalence	Step comparison uses exact string matching.	Use semantic similarity (cosine $\ge 0.85$) for step-level comparison instead.

Complete Evaluation Pipeline Scenario

Below is a typical end-to-end evaluation execution for a code-generation agent:

[PR Opened] ──► CI Trigger fires
                    │
[Stage 1] ──► Load 50-case smoke test dataset ──► Run agent on all cases
                                                        │
[Stage 2] ──► Route: 30 exec-based (pass@1) + 20 judge-based (correctness)
                    │                                    │
[Stage 3] ──► pass@1 = 0.83 (baseline: 0.81)    judge_mean = 4.2/5 (baseline: 4.1/5)
                    │                                    │
[Stage 4] ──► Paired t-test: p=0.23 (not significant)   p=0.34 (not significant)
                    │
[Stage 5] ──► Verdict: PASS ──► Merge allowed ──► Full eval queued on main branch

Rules and Guidelines

Rule 1: Never evaluate agent outputs with a single sample. All eval dimensions must use $N \ge 5$ samples with reported confidence intervals.
Rule 2: LLM-as-judge prompts must include explicit scoring rubrics with level definitions (e.g., 1=incorrect, 3=partially correct, 5=fully correct).
Rule 3: Trajectory evaluations must score both the correctness of individual steps AND the optimality of the overall path.
Rule 4: Eval datasets must be versioned using content hashes and must never be modified in-place. Create new versions instead.
Rule 5: Regression detection must use family-wise error rate correction (Holm-Bonferroni) when testing across multiple dimensions simultaneously.

Reference Guides

Below are links to the reference guides detailing the algorithms, data schemas, code implementations, and integration patterns used in this evaluation testing framework:

trajectory-evaluation.md Covers step-by-step trajectory scoring algorithms, golden trajectory comparison, semantic step matching, and trajectory optimality metrics.
llm-as-judge-patterns.md Details LLM-as-judge architectures including pointwise scoring, pairwise comparison, reference-guided judging, consensus protocols, and calibration techniques.
verifier-agent-patterns.md Defines dedicated verification agent topologies, cross-model verification, execution-based verification, and multi-agent debate protocols.
cicd-eval-integration.md Provides CI/CD pipeline configurations for GitHub Actions, GitLab CI, and Jenkins with tiered eval stages, artifact management, and deployment gates.
regression-detection.md Outlines statistical regression detection methods, baseline management, effect size calculations, and alerting threshold configurations.
benchmark-design.md Explains benchmark dataset design principles, difficulty stratification, contamination prevention, and saturation detection algorithms.
hallucination-scoring.md Covers hallucination detection and scoring pipelines, claim extraction, source verification, faithfulness metrics, and severity classification.
eval-dataset-management.md Defines dataset versioning schemas, content-hash registries, stratified sampling strategies, and dataset refresh lifecycle management.

Handoff

For projects requiring prompt optimization before evaluation, hand off to context-engineering. For systems implementing architectural constraints on agent behavior, hand off to architectural-constraints. For agent failure recovery during evaluation runs, hand off to error-recovery.