Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

agent-eval-harness

Estrellas5

Forks0

Actualizado10 de febrero de 2026, 18:29

CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.

Instalación

Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.

Ejecutar en Manus

Fuente

youdotcom-oss

youdotcom-oss/web-search-agent-evals

Abrir repositorio de GitHub Ver repositorios del creador

Descarga

Ejecutar en Manus

Ocupaciones relacionadasSOC

Basado en la clasificación ocupacional SOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas·SOC 15-1253

Explorador de archivos

9 archivos

SKILL.md

readonly

Más de este repositorio

mismo repositorio

web-search-agent-evals

youdotcom-oss/web-search-agent-evals

Development assistant for web search agent evaluations across multiple CLI agents

2026-02-195

headless-adapters

youdotcom-oss/web-search-agent-evals

Discover, create, and validate headless adapters for agent integration. Includes scaffolding tools and schema-driven compliance testing.

2026-02-105

scaffold-rules

youdotcom-oss/web-search-agent-evals

Scaffold development rules for AI coding agents. Auto-invoked when user asks about setting up rules, coding conventions, or configuring their AI agent environment.

2026-02-105

youdotcom-cli

youdotcom-oss/web-search-agent-evals

Web search, AI-powered research with citations, and content extraction for bash agents using You.com's @youdotcom-oss/api CLI. Interactive workflow covers API setup, livecrawl (one-call search+extract), deep-search for cited answers, and schema-driven JSON queries. Faster than built-in search with verifiable references.

2026-02-105

Ejecuta cualquier Skill con un clic

name	agent-eval-harness
description	CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.
compatibility	Bun >= 1.2.9

Agent Eval Harness

Purpose

CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun.

The harness captures. You score.

Harness Provides	You Provide
Prompt execution via headless adapters	Scoring logic (Braintrust, custom scripts)
Full trajectory capture (thoughts, tools, plans)	Pass/fail determination via graders
Structured JSONL output	LLM-as-judge prompts
Reproducible execution environment	CI integration, golden file comparison

Use this when:

Capturing trajectories for downstream evaluation
Generating training data (SFT/DPO) with full context
Building regression test fixtures for agent behavior
Comparing agent responses across configurations

Installation

# Run without installing (recommended)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

# Or install as project dependency
bun add @plaited/agent-eval-harness

Core Principle: Capture Once, Derive Many Views

flowchart LR
    Prompts["prompts.jsonl"] --> Capture["capture/trials"]
    Schema["headless schema"] --> Capture
    Capture --> Results["results.jsonl (full trajectory)"]
    Results --> Summarize["summarize"]
    Results --> Calibrate["calibrate"]
    Results --> Custom["(your tools)"]
    Summarize --> Views["summary.jsonl / .md"]
    Calibrate --> Report["calibration.md"]
    Custom --> Pipeline["any scoring platform"]

Single output format: Full trajectory JSONL (always) No --format flag: Derive views with separate commands Schema exports: Zod schemas + JSON Schema for any tooling

Commands

Core Commands

Command	Input	Output	Purpose
`capture`	prompts.jsonl + schema	results.jsonl	Trajectory capture (full)
`trials`	prompts.jsonl + schema	trials.jsonl	Multi-run + optional metrics
`summarize`	results.jsonl	summary.jsonl or .md	Derive compact views
`calibrate`	results.jsonl	calibration.md	Sample failures for review
`validate-refs`	prompts.jsonl	validation.jsonl	Check reference solutions
`balance`	prompts.jsonl	balance.json	Analyze test set coverage
`schemas`	(none)	JSON Schema	Export schemas for non-TS users

Pipeline Commands (Unix-style)

Command	Input	Output	Purpose
`run`	prompts.jsonl + schema	raw.jsonl	Execute prompts, raw output
`extract`	raw.jsonl + schema	extracted.jsonl	Parse trajectories
`grade`	extracted.jsonl + grader	graded.jsonl	Apply grader scoring
`format`	results.jsonl	jsonl/markdown/csv	Convert output format
`compare`	multiple results.jsonl	comparison.json	Compare runs (aggregate report)

All commands support optional --grader ./grader.ts for scoring.

Capture Command

Basic Usage

bunx @plaited/agent-eval-harness capture <prompts.jsonl> --schema <schema.json> [options]

Arguments

Argument/Flag	Description	Default
`prompts.jsonl`	Input file with prompts to execute	Required
`-s, --schema`	Path to headless adapter schema	Required
`-o, --output`	Output file/path	stdout
`-c, --cwd`	Working directory for agent	current
`-t, --timeout`	Request timeout in ms	`60000`
`-j, --concurrency`	Number of concurrent workers	`1`
`--workspace-dir`	Base directory for per-prompt workspace isolation	none
`--progress`	Show progress to stderr	false
`--append`	Append to output file	false
`-g, --grader`	Path to grader module	none
`--debug`	Show detailed CLI output for debugging	false

Examples

# Basic capture
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

# Parallel execution (4x faster with 4 workers)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -j 4 -o results.jsonl

# With workspace isolation for code generation tasks
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json \
  -j 4 --workspace-dir ./workspaces -o results.jsonl

# Using a local adapter script
bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl

# With grader (adds score to each result)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl

Trials Command

Run each prompt multiple times for pass@k/pass^k analysis.

# Capture only (no grader)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl

# With grader (computes pass@k, pass^k)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl

# Parallel execution (4 prompts' trials run concurrently)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4 -o trials.jsonl

# With workspace isolation (each trial gets its own directory)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4 \
  --workspace-dir ./workspaces -o trials.jsonl

Parallelization notes:

-j/--concurrency parallelizes across prompts (not trials within a prompt)
Each prompt's k trials still run sequentially (required for aggregation)
With 151 prompts and -j 4, you get 4 prompts running trials concurrently
--workspace-dir creates {workspace-dir}/prompt-{id}-trial-{n}/ for each trial
Progress logging shows aggregate completion (e.g., 12/50 prompts completed)

Workspace cleanup: Directories persist after completion for debugging. Clean up manually:

# After capture
rm -rf ./workspaces

# In CI (add as post-step)
- run: rm -rf ./workspaces
  if: always()

Output

Without grader:

{"id":"search-001","input":"Find the CEO","k":5,"trials":[{"trialNum":1,"output":"...","trajectory":[...],"duration":1234},...]}

With grader:

{"id":"search-001","input":"Find the CEO","k":5,"passRate":0.8,"passAtK":0.99,"passExpK":0.33,"trials":[{"trialNum":1,"output":"...","pass":true,"score":1.0},...]}

Summarize Command

Derive compact views from full trajectory results.

# Summary JSONL (for jq analysis)
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

# Markdown (for LLM-as-judge)
bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md

Calibrate Command

Sample failures for grader review. Calibration helps you distinguish between agent failures (agent did wrong thing) and grader bugs (agent was correct, grader too strict).

# Sample failures for human review
bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md

# Re-score with different grader to compare
bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md

See eval-concepts.md for why calibration matters.

Validate-Refs Command

Check that reference solutions pass your grader before evaluating agents.

# Validate reference solutions
bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl

# Check for failures
cat validation.jsonl | jq 'select(.pass == false)'

Why Use This?

If your reference solution fails your own grader:

The task definition is ambiguous
The grader is too strict
The hint is wrong

Fix the eval before evaluating the agent.

Input Format

Prompts must include a reference field:

{"id":"test-001","input":"Create a button component","hint":"<button>","reference":"export const Button = () => <button>Click</button>"}

Output Format

{"id":"test-001","input":"Create a button component","reference":"export const Button = () => <button>Click</button>","pass":true,"score":1.0,"reasoning":"Contains hint content"}

Balance Command

Analyze test set coverage to ensure balanced evaluation.

# Analyze prompt distribution
bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json

# Pretty print
bunx @plaited/agent-eval-harness balance prompts.jsonl | jq .

Why Use This?

An eval with only "make X work" misses "don't break Y". Balance analysis shows:

Category distribution (from metadata.category)
Positive/negative case ratio
Coverage gaps

Output Format

{
  "totalCases": 50,
  "categories": [
    { "name": "ui", "count": 20, "percentage": 40 },
    { "name": "logic", "count": 15, "percentage": 30 },
    { "name": "api", "count": 10, "percentage": 20 },
    { "name": "edge-case", "count": 5, "percentage": 10 }
  ],
  "underrepresented": ["edge-case"],
  "suggestions": ["Consider adding more test cases for: edge-case"]
}

Balanced Eval Design

Include both positive and negative cases:

Type	Example	Purpose
Positive	"Add a login button"	Agent should succeed
Negative	"Add a button without breaking tests"	Agent should not break things
Edge case	"Handle empty input gracefully"	Agent should be robust

See eval-concepts.md for more on balanced test sets.

Pipeline Workflow

The pipeline commands enable Unix-style composition for flexible evaluation workflows.

Full Pipeline Example

# Execute → Extract → Grade → Format in one pipeline
cat prompts.jsonl | \
  bunx @plaited/agent-eval-harness run -s claude.json | \
  bunx @plaited/agent-eval-harness extract -s claude.json | \
  bunx @plaited/agent-eval-harness grade -g ./grader.ts | \
  bunx @plaited/agent-eval-harness format -f markdown > report.md

Run Command

Execute prompts and output raw results. Three modes available:

# Schema mode (recommended)
bunx @plaited/agent-eval-harness run prompts.jsonl --schema claude.json

# Simple mode: {} placeholder substitution
bunx @plaited/agent-eval-harness run prompts.jsonl --simple "claude -p {} --output-format stream-json"

# Shell mode: $PROMPT environment variable
bunx @plaited/agent-eval-harness run prompts.jsonl --shell 'claude -p "$PROMPT" --output-format stream-json'

⚠️ Security Warning: The --simple and --shell modes execute prompts via shell commands. Prompts are escaped but do not use untrusted prompt content with these modes. Malicious prompt text could potentially escape the quoting and execute arbitrary commands. Use --schema mode (headless adapter) for untrusted inputs.

Extract Command

Parse raw output into structured trajectories:

# From file
bunx @plaited/agent-eval-harness extract raw.jsonl --schema claude.json -o extracted.jsonl

# Piped from run
bunx @plaited/agent-eval-harness run prompts.jsonl -s claude.json | \
  bunx @plaited/agent-eval-harness extract -s claude.json

Grade Command

Apply grader to extracted results:

bunx @plaited/agent-eval-harness grade extracted.jsonl --grader ./grader.ts -o graded.jsonl

Format Command

Convert results to different output formats:

# Markdown report
bunx @plaited/agent-eval-harness format results.jsonl --style markdown -o report.md

# CSV for spreadsheets
bunx @plaited/agent-eval-harness format results.jsonl --style csv -o results.csv

# JSONL (pass-through, default)
bunx @plaited/agent-eval-harness format results.jsonl --style jsonl

Compare Command

Compare multiple runs of the same prompts. Supports both CaptureResult (single-run) and TrialResult (multi-run reliability) formats with auto-detection.

# Default: auto-detect format, weighted strategy, JSON output
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

# Statistical significance strategy
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --strategy statistical -o comparison.json

# Custom weights via environment variables (CaptureResult)
COMPARE_QUALITY=0.7 COMPARE_LATENCY=0.2 COMPARE_RELIABILITY=0.1 \
  bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

# Markdown report format
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --format markdown -o report.md

# Custom grader (LLM-as-Judge)
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl \
  --strategy custom --grader ./my-llm-judge.ts -o comparison.json

# With explicit labels
bunx @plaited/agent-eval-harness compare \
  --run "with-mcp:results-mcp.jsonl" \
  --run "vanilla:results-vanilla.jsonl" \
  -o comparison.json

Use cases for compare:

Same agent, different MCP servers
Same agent, different skills enabled
Same agent, different model versions
Different agents entirely

Trials Comparison (pass@k Analysis)

Compare TrialResult files for reliability analysis:

# Auto-detect trials format
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json

# Explicit format (skip auto-detection)
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl --input-format trials -o comparison.json

# Custom weights for trials comparison
COMPARE_CAPABILITY=0.5 COMPARE_RELIABILITY=0.3 COMPARE_CONSISTENCY=0.2 \
  bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json

Trials metrics:

Metric	Description	Formula
Capability (passAtK)	Can solve at least once in K tries	`1 - (1-p)^k`
Reliability (passExpK)	Solves consistently every time	`p^k`
Flakiness	Gap between capability and reliability	`passAtK - passExpK`
Quality (scores)	Aggregate grader scores across trials	avg/median/p25/p75 (only with grader)
Performance (latency)	Aggregate trial durations	p50/p90/p99/mean/min/max (always present)

Built-in Comparison Strategies

For CaptureResult (single-run):

Strategy	Description	Env Vars
`weighted` (default)	Quality, latency, reliability	`COMPARE_QUALITY`, `COMPARE_LATENCY`, `COMPARE_RELIABILITY`
`statistical`	Bootstrap for confidence intervals	`COMPARE_BOOTSTRAP_ITERATIONS`
`custom`	Your own grader	`--grader path`

For TrialResult (multi-run):

Strategy	Description	Env Vars
`weighted` (default)	Capability, reliability, consistency	`COMPARE_CAPABILITY`, `COMPARE_RELIABILITY`, `COMPARE_CONSISTENCY`
`statistical`	Bootstrap passAtK confidence intervals	`COMPARE_BOOTSTRAP_ITERATIONS`
`custom`	Your own grader	`--grader path`

Comparison Report Output

CaptureResult format outputs ComparisonReport:

{
  "meta": { "generatedAt": "...", "runs": ["baseline", "variant"], "promptCount": 100 },
  "quality": { "baseline": { "avgScore": 0.85, "passRate": 0.82 }, "variant": { ... } },
  "performance": { "baseline": { "latency": { "p50": 1200, "p90": 3400 } }, ... },
  "reliability": { "baseline": { "type": "run", "toolErrors": 5, "completionRate": 0.99 }, ... },
  "headToHead": { "pairwise": [{ "runA": "baseline", "runB": "variant", "aWins": 35, "bWins": 55 }] }
}

With --strategy statistical, quality and performance metrics include 95% confidence intervals:

{
  "quality": {
    "baseline": {
      "avgScore": 0.85,
      "passRate": 0.82,
      "confidenceIntervals": {
        "avgScore": [0.82, 0.88],
        "passRate": [0.79, 0.85]
      }
    }
  },
  "performance": {
    "baseline": {
      "latency": { "p50": 1200, "mean": 1350 },
      "confidenceIntervals": {
        "latencyMean": [1280, 1420]
      }
    }
  }
}

TrialResult format outputs TrialsComparisonReport:

{
  "meta": { "generatedAt": "...", "runs": ["claude", "gemini"], "promptCount": 50, "trialsPerPrompt": 5, "inputFormat": "trials" },
  "capability": { "claude": { "avgPassAtK": 0.92, "medianPassAtK": 0.95 }, "gemini": { "..." : "..." } },
  "reliability": { "claude": { "type": "trial", "avgPassExpK": 0.78, "medianPassExpK": 0.82 }, "gemini": { "..." : "..." } },
  "flakiness": { "claude": { "avgFlakiness": 0.14, "flakyPromptCount": 12 }, "gemini": { "..." : "..." } },
  "quality": { "claude": { "avgScore": 0.85, "medianScore": 0.90, "p25Score": 0.75, "p75Score": 0.95 }, "gemini": { "..." : "..." } },
  "performance": { "claude": { "latency": { "p50": 1200, "p90": 3400, "p99": 5100, "mean": 1500, "min": 800, "max": 5200 }, "totalDuration": 375000 }, "gemini": { "..." : "..." } },
  "headToHead": {
    "capability": [{ "runA": "claude", "runB": "gemini", "aWins": 28, "bWins": 18, "ties": 4 }],
    "reliability": ["..."],
    "overall": ["..."]
  }
}

Notes:

quality is only present when a grader was used (trials have score fields)
performance is always present (every trial has duration)

With --strategy statistical, capability, reliability, quality, and performance metrics include 95% confidence intervals:

{
  "capability": {
    "claude": {
      "avgPassAtK": 0.92,
      "confidenceIntervals": { "avgPassAtK": [0.88, 0.95] }
    }
  },
  "reliability": {
    "claude": {
      "type": "trial",
      "avgPassExpK": 0.78,
      "confidenceIntervals": { "avgPassExpK": [0.72, 0.84] }
    }
  },
  "quality": {
    "claude": {
      "avgScore": 0.85,
      "confidenceIntervals": { "avgScore": [0.82, 0.88] }
    }
  },
  "performance": {
    "claude": {
      "latency": { "mean": 1500 },
      "confidenceIntervals": { "latencyMean": [1380, 1620] }
    }
  }
}

See comparison-graders.md for complete comparison grader documentation including LLM-as-Judge patterns.

Comparison Grader Interface

CaptureResult grader:

import type { ComparisonGrader } from '@plaited/agent-eval-harness/pipeline'

export const grade: ComparisonGrader = async ({ id, input, hint, runs }) => {
  // runs is Record<string, { output, trajectory?, score?, duration?, toolErrors? }>
  return {
    rankings: [
      { run: 'with-mcp', rank: 1, score: 0.9 },
      { run: 'vanilla', rank: 2, score: 0.7 },
    ],
    reasoning: 'MCP run produced more accurate output'
  }
}

TrialResult grader:

import type { TrialsComparisonGrader } from '@plaited/agent-eval-harness/pipeline'

export const grade: TrialsComparisonGrader = async ({ id, input, hint, runs }) => {
  // runs is Record<string, { passAtK?, passExpK?, k, trials }>
  // Each trial in trials has: { duration, score?, pass?, output, trajectory }
  return {
    rankings: [
      { run: 'claude', rank: 1, score: 0.92 },
      { run: 'gemini', rank: 2, score: 0.85 },
    ],
    reasoning: 'Claude has higher reliability with lower flakiness'
  }
}

Pipeline Workflow Diagram

flowchart LR
    Prompts["prompts.jsonl"] --> Run["run"]
    Schema["headless schema"] --> Run
    Run --> Raw["raw.jsonl"]
    Raw --> Extract["extract"]
    Schema --> Extract
    Extract --> Extracted["extracted.jsonl"]
    Extracted --> Grade["grade"]
    Grader["grader.ts"] --> Grade
    Grade --> Graded["graded.jsonl"]
    Graded --> Format["format"]
    Format --> Output["report.md / .csv / .jsonl"]

    Graded --> Compare["compare"]
    Results2["other runs..."] --> Compare
    CompareGrader["compare-grader.ts"] --> Compare
    Compare --> Comparison["comparison.jsonl"]

Schemas Command

Export JSON schemas for non-TypeScript tools.

# List available schemas
bunx @plaited/agent-eval-harness schemas

# Export all schemas as JSON
bunx @plaited/agent-eval-harness schemas --json -o schemas.json

# Export specific schema
bunx @plaited/agent-eval-harness schemas CaptureResult --json
bunx @plaited/agent-eval-harness schemas TrialResult --json
bunx @plaited/agent-eval-harness schemas GraderResult --json

Available Schemas

Schema	Description
`CaptureResult`	Single capture output (id, input, output, trajectory, timing)
`TrialResult`	Multi-run trial output (includes passAtK, passExpK)
`GraderResult`	Grader return value (pass, score, reasoning)
`PromptInput`	Input prompt format
`TrajectoryStep`	Single step in trajectory array
`SummaryResult`	Compact summary format

Usage in Other Languages

Export schemas for validation in Python, Go, etc.:

# Export all schemas
bunx @plaited/agent-eval-harness schemas --json -o schemas.json

# Use in Python with jsonschema
python -c "
import json
from jsonschema import validate

with open('schemas.json') as f:
    schemas = json.load(f)

with open('results.jsonl') as f:
    for line in f:
        result = json.loads(line)
        validate(result, schemas['CaptureResult'])
        print(f'{result[\"id\"]}: valid')
"

Grader Interface

Graders provide semantic pass/fail scoring for captured trajectories. The harness supports graders written in any language.

Git-Based Grading (Recommended for Coding Tasks)

Grade outcomes, not paths. Use the optional cwd parameter to detect environmental changes with git:

// git-grader.ts
import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ output, hint, cwd }) => {
  if (!cwd) return { pass: false, score: 0, reasoning: 'No cwd' }
  
  // Detect file changes
  const status = await Bun.$`git -C ${cwd} status --porcelain`.text()
  const filesCreated = status
    .split('\n')
    .filter(line => line.startsWith('??'))
    .map(line => line.slice(3).trim())
  
  // Verify tests pass
  const testResult = await Bun.$`cd ${cwd} && bun test`.nothrow()
  
  return {
    pass: filesCreated.length > 0 && testResult.exitCode === 0,
    score: testResult.exitCode === 0 ? 1 : 0,
    reasoning: `Files: ${filesCreated.join(', ')}. Tests: ${testResult.exitCode === 0 ? 'pass' : 'fail'}`,
    outcome: {  // Optional: structured data for analysis
      filesCreated,
      testsPassed: testResult.exitCode === 0,
      type: 'file_creation_with_tests'
    }
  }
}

See inline-graders.md for comprehensive git-based grading patterns.

Output-Based Grading (General Purpose)

// my-grader.ts
import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ input, output, hint, trajectory }) => {
  const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '')
  return {
    pass,
    score: pass ? 1 : 0,
    reasoning: pass ? 'Contains hint content' : 'Missing hint content'
  }
}

Note: input can be string (single turn) or string[] (multi-turn). The hint field provides grader context (renamed from expected).

Python/Executable Graders

Any executable can be a grader using stdin/stdout JSON protocol:

#!/usr/bin/env python3
import json, sys

data = json.load(sys.stdin)
output = data.get("output", "").lower()
hint = (data.get("hint") or "").lower()

pass_result = hint in output if hint else True
print(json.dumps({
    "pass": pass_result,
    "score": 1.0 if pass_result else 0.0,
    "reasoning": "Contains hint" if pass_result else "Missing hint"
}))

chmod +x ./grader.py
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl

See inline-graders.md for complete grader documentation including LLM-as-Judge patterns.

Input Format

Each line in prompts.jsonl:

{"id":"test-001","input":"Create a button","hint":"should contain <button>"}
{"id":"test-002","input":["Create a button","Make it blue"],"metadata":{"category":"ui"}}

Field	Required	Description
`id`	Yes	Unique identifier
`input`	Yes	Single prompt (string) or conversation turns (string[])
`hint`	No	Grader context - what to look for (not strict match)
`reference`	No	Reference solution (for validate-refs)
`metadata`	No	Tags, category, difficulty for filtering
`timeout`	No	Override default timeout for this prompt

Session behavior: Each JSONL entry = 1 fresh session

input: string → 1 session, 1 prompt
input: string[] → 1 session, N prompts (sequential turns)

Output Format

Full trajectory JSONL (always):

{
  "id": "test-001",
  "input": "Find the CEO of Anthropic",
  "output": "The CEO of Anthropic is Dario Amodei.",
  "hint": "should mention Dario Amodei",
  "trajectory": [
    {"type": "thought", "content": "I'll search for this...", "timestamp": 100},
    {"type": "tool_call", "name": "WebSearch", "status": "completed", "input": {...}, "output": {...}, "duration": 500},
    {"type": "message", "content": "The CEO of Anthropic is Dario Amodei.", "timestamp": 700}
  ],
  "metadata": {
    "category": "search",
    "agent": "--schema ./claude.json",
    "trajectoryRichness": "full",
    "turnCount": 1
  },
  "timing": {
    "start": 1704067200000,
    "end": 1704067201234,
    "firstResponse": 100,
    "sessionCreation": 234,
    "total": 1234,
    "inputTokens": 150,
    "outputTokens": 85
  },
  "toolErrors": false
}

Output Fields

Field	Description
`input`	Original prompt (string or string[] for multi-turn)
`hint`	Grader context hint (if provided)
`metadata.trajectoryRichness`	`"full"` \| `"messages-only"` \| `"minimal"`
`metadata.turnCount`	Number of conversation turns (1 for string, N for array)
`timing.sessionCreation`	Time to create session (ms)
`timing.total`	Total duration (end - start)
`timing.inputTokens`	Input tokens consumed (if available from adapter)
`timing.outputTokens`	Output tokens generated (if available from adapter)
`toolErrors`	Whether any tool calls failed

Note: toolErrors replaces misleading status: 'passed'|'failed'. Real pass/fail comes from YOUR grader.

Schema Exports

Consumers can import Zod schemas directly:

import { CaptureResultSchema, TrialResultSchema } from '@plaited/agent-eval-harness/schemas'

// Validate external data
const result = CaptureResultSchema.parse(jsonData)

// Generate JSON Schema (Zod 4 native)
import { z } from 'zod'
const jsonSchema = z.toJSONSchema(CaptureResultSchema)

Discriminated Unions for Reliability Metrics

Reliability metrics include a type discriminator for type-safe parsing:

import { z } from 'zod'
import {
  ReliabilityMetricsSchema,       // type: 'run'
  TrialsReliabilityMetricsSchema  // type: 'trial'
} from '@plaited/agent-eval-harness/schemas'

// Create a unified schema for both metric types
const UnifiedReliabilitySchema = z.discriminatedUnion('type', [
  ReliabilityMetricsSchema,
  TrialsReliabilityMetricsSchema,
])

// Type-safe parsing with automatic narrowing
const metrics = UnifiedReliabilitySchema.parse(data)
if (metrics.type === 'run') {
  // TypeScript knows: ReliabilityMetrics
  console.log(metrics.toolErrors, metrics.completionRate)
} else {
  // TypeScript knows: TrialsReliabilityMetrics
  console.log(metrics.avgPassExpK, metrics.medianPassExpK)
}

Or export JSON schemas for non-TypeScript tools:

bunx @plaited/agent-eval-harness schemas --json -o schemas.json
bunx @plaited/agent-eval-harness schemas CaptureResult --json

Execution Environment

Recommendation: Run the harness in Docker containers for consistent, isolated execution.

# Run integration tests via Docker
docker compose -f docker-compose.test.yml run --rm test

# Or with explicit API keys
ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm test

Docker Requirements

Requirement	Reason
Node.js 24+	Gemini CLI uses modern JS features (optional chaining)
Non-root user	Claude CLI blocks `--dangerously-skip-permissions` as root
Gemini API key	Pass `GEMINI_API_KEY` for Gemini CLI

See docker-evals.md for complete Docker setup guide, debugging tips, and CI integration patterns.

Multi-turn Conversations

Use input: string[] to execute multi-turn conversations within a single session:

{"id":"context-001","input":["Remember this number: 42","What number did I ask you to remember?"],"hint":"42"}
{"id":"context-002","input":["My name is Alice","What is my name?"],"hint":"Alice"}

Run with the headless adapter:

# Using Claude Code via headless adapter
bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
  bunx @plaited/agent-eval-harness headless --schema ./claude-headless.json \
  -o results.jsonl

# Using Gemini CLI via headless adapter
GEMINI_API_KEY=... bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
  bunx @plaited/agent-eval-harness headless --schema ./gemini-headless.json \
  -o results.jsonl

Key points:

Each JSONL entry = 1 fresh session
input: string[] sends sequential turns to the same session
Works with both stream mode (Claude) and iterative mode (Gemini)
The adapter handles context preservation automatically

Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'

# Count tool usage
cat results.jsonl | jq -s 'map(.trajectory | map(select(.type == "tool_call")) | length) | add'

# Summarize for quick analysis
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

# Compare runs with built-in strategies
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

Quick Reference

Resource	Description
`bunx @plaited/agent-eval-harness`	CLI help
output-formats.md	JSONL schemas, command details
inline-graders.md	Single input/output graders (TypeScript, Python, shell)
comparison-graders.md	Comparison strategies (weighted, statistical, LLM-as-Judge)
calibration.md	Grader calibration workflow
eval-concepts.md	Evaluation concepts (pass@k, pass^k)
docker-evals.md	Docker setup, debugging, CI integration

headless-adapters skill - Schema-driven adapters for headless CLI agents

name	agent-eval-harness
description	CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.
compatibility	Bun >= 1.2.9

Agent Eval Harness

Purpose

CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun.

The harness captures. You score.

Harness Provides	You Provide
Prompt execution via headless adapters	Scoring logic (Braintrust, custom scripts)
Full trajectory capture (thoughts, tools, plans)	Pass/fail determination via graders
Structured JSONL output	LLM-as-judge prompts
Reproducible execution environment	CI integration, golden file comparison

Use this when:

Capturing trajectories for downstream evaluation
Generating training data (SFT/DPO) with full context
Building regression test fixtures for agent behavior
Comparing agent responses across configurations

Installation

# Run without installing (recommended)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

# Or install as project dependency
bun add @plaited/agent-eval-harness

Core Principle: Capture Once, Derive Many Views

flowchart LR
    Prompts["prompts.jsonl"] --> Capture["capture/trials"]
    Schema["headless schema"] --> Capture
    Capture --> Results["results.jsonl (full trajectory)"]
    Results --> Summarize["summarize"]
    Results --> Calibrate["calibrate"]
    Results --> Custom["(your tools)"]
    Summarize --> Views["summary.jsonl / .md"]
    Calibrate --> Report["calibration.md"]
    Custom --> Pipeline["any scoring platform"]

Single output format: Full trajectory JSONL (always) No --format flag: Derive views with separate commands Schema exports: Zod schemas + JSON Schema for any tooling

Commands

Core Commands

Command	Input	Output	Purpose
`capture`	prompts.jsonl + schema	results.jsonl	Trajectory capture (full)
`trials`	prompts.jsonl + schema	trials.jsonl	Multi-run + optional metrics
`summarize`	results.jsonl	summary.jsonl or .md	Derive compact views
`calibrate`	results.jsonl	calibration.md	Sample failures for review
`validate-refs`	prompts.jsonl	validation.jsonl	Check reference solutions
`balance`	prompts.jsonl	balance.json	Analyze test set coverage
`schemas`	(none)	JSON Schema	Export schemas for non-TS users

Pipeline Commands (Unix-style)

Command	Input	Output	Purpose
`run`	prompts.jsonl + schema	raw.jsonl	Execute prompts, raw output
`extract`	raw.jsonl + schema	extracted.jsonl	Parse trajectories
`grade`	extracted.jsonl + grader	graded.jsonl	Apply grader scoring
`format`	results.jsonl	jsonl/markdown/csv	Convert output format
`compare`	multiple results.jsonl	comparison.json	Compare runs (aggregate report)

All commands support optional --grader ./grader.ts for scoring.

Capture Command

Basic Usage

bunx @plaited/agent-eval-harness capture <prompts.jsonl> --schema <schema.json> [options]

Arguments

Argument/Flag	Description	Default
`prompts.jsonl`	Input file with prompts to execute	Required
`-s, --schema`	Path to headless adapter schema	Required
`-o, --output`	Output file/path	stdout
`-c, --cwd`	Working directory for agent	current
`-t, --timeout`	Request timeout in ms	`60000`
`-j, --concurrency`	Number of concurrent workers	`1`
`--workspace-dir`	Base directory for per-prompt workspace isolation	none
`--progress`	Show progress to stderr	false
`--append`	Append to output file	false
`-g, --grader`	Path to grader module	none
`--debug`	Show detailed CLI output for debugging	false

Examples

# Basic capture
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl

# Parallel execution (4x faster with 4 workers)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -j 4 -o results.jsonl

# With workspace isolation for code generation tasks
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json \
  -j 4 --workspace-dir ./workspaces -o results.jsonl

# Using a local adapter script
bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl

# With grader (adds score to each result)
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl

Trials Command

Run each prompt multiple times for pass@k/pass^k analysis.

# Capture only (no grader)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl

# With grader (computes pass@k, pass^k)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl

# Parallel execution (4 prompts' trials run concurrently)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4 -o trials.jsonl

# With workspace isolation (each trial gets its own directory)
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4 \
  --workspace-dir ./workspaces -o trials.jsonl

Parallelization notes:

-j/--concurrency parallelizes across prompts (not trials within a prompt)
Each prompt's k trials still run sequentially (required for aggregation)
With 151 prompts and -j 4, you get 4 prompts running trials concurrently
--workspace-dir creates {workspace-dir}/prompt-{id}-trial-{n}/ for each trial
Progress logging shows aggregate completion (e.g., 12/50 prompts completed)

Workspace cleanup: Directories persist after completion for debugging. Clean up manually:

# After capture
rm -rf ./workspaces

# In CI (add as post-step)
- run: rm -rf ./workspaces
  if: always()

Output

Without grader:

{"id":"search-001","input":"Find the CEO","k":5,"trials":[{"trialNum":1,"output":"...","trajectory":[...],"duration":1234},...]}

With grader:

{"id":"search-001","input":"Find the CEO","k":5,"passRate":0.8,"passAtK":0.99,"passExpK":0.33,"trials":[{"trialNum":1,"output":"...","pass":true,"score":1.0},...]}

Summarize Command

Derive compact views from full trajectory results.

# Summary JSONL (for jq analysis)
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

# Markdown (for LLM-as-judge)
bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md

Calibrate Command

Sample failures for grader review. Calibration helps you distinguish between agent failures (agent did wrong thing) and grader bugs (agent was correct, grader too strict).

# Sample failures for human review
bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md

# Re-score with different grader to compare
bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md

See eval-concepts.md for why calibration matters.

Validate-Refs Command

Check that reference solutions pass your grader before evaluating agents.

# Validate reference solutions
bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl

# Check for failures
cat validation.jsonl | jq 'select(.pass == false)'

Why Use This?

If your reference solution fails your own grader:

The task definition is ambiguous
The grader is too strict
The hint is wrong

Fix the eval before evaluating the agent.

Input Format

Prompts must include a reference field:

{"id":"test-001","input":"Create a button component","hint":"<button>","reference":"export const Button = () => <button>Click</button>"}

Output Format

{"id":"test-001","input":"Create a button component","reference":"export const Button = () => <button>Click</button>","pass":true,"score":1.0,"reasoning":"Contains hint content"}

Balance Command

Analyze test set coverage to ensure balanced evaluation.

# Analyze prompt distribution
bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json

# Pretty print
bunx @plaited/agent-eval-harness balance prompts.jsonl | jq .

Why Use This?

An eval with only "make X work" misses "don't break Y". Balance analysis shows:

Category distribution (from metadata.category)
Positive/negative case ratio
Coverage gaps

Output Format

{
  "totalCases": 50,
  "categories": [
    { "name": "ui", "count": 20, "percentage": 40 },
    { "name": "logic", "count": 15, "percentage": 30 },
    { "name": "api", "count": 10, "percentage": 20 },
    { "name": "edge-case", "count": 5, "percentage": 10 }
  ],
  "underrepresented": ["edge-case"],
  "suggestions": ["Consider adding more test cases for: edge-case"]
}

Balanced Eval Design

Include both positive and negative cases:

Type	Example	Purpose
Positive	"Add a login button"	Agent should succeed
Negative	"Add a button without breaking tests"	Agent should not break things
Edge case	"Handle empty input gracefully"	Agent should be robust

See eval-concepts.md for more on balanced test sets.

Pipeline Workflow

The pipeline commands enable Unix-style composition for flexible evaluation workflows.

Full Pipeline Example

# Execute → Extract → Grade → Format in one pipeline
cat prompts.jsonl | \
  bunx @plaited/agent-eval-harness run -s claude.json | \
  bunx @plaited/agent-eval-harness extract -s claude.json | \
  bunx @plaited/agent-eval-harness grade -g ./grader.ts | \
  bunx @plaited/agent-eval-harness format -f markdown > report.md

Run Command

Execute prompts and output raw results. Three modes available:

# Schema mode (recommended)
bunx @plaited/agent-eval-harness run prompts.jsonl --schema claude.json

# Simple mode: {} placeholder substitution
bunx @plaited/agent-eval-harness run prompts.jsonl --simple "claude -p {} --output-format stream-json"

# Shell mode: $PROMPT environment variable
bunx @plaited/agent-eval-harness run prompts.jsonl --shell 'claude -p "$PROMPT" --output-format stream-json'

⚠️ Security Warning: The --simple and --shell modes execute prompts via shell commands. Prompts are escaped but do not use untrusted prompt content with these modes. Malicious prompt text could potentially escape the quoting and execute arbitrary commands. Use --schema mode (headless adapter) for untrusted inputs.

Extract Command

Parse raw output into structured trajectories:

# From file
bunx @plaited/agent-eval-harness extract raw.jsonl --schema claude.json -o extracted.jsonl

# Piped from run
bunx @plaited/agent-eval-harness run prompts.jsonl -s claude.json | \
  bunx @plaited/agent-eval-harness extract -s claude.json

Grade Command

Apply grader to extracted results:

bunx @plaited/agent-eval-harness grade extracted.jsonl --grader ./grader.ts -o graded.jsonl

Format Command

Convert results to different output formats:

# Markdown report
bunx @plaited/agent-eval-harness format results.jsonl --style markdown -o report.md

# CSV for spreadsheets
bunx @plaited/agent-eval-harness format results.jsonl --style csv -o results.csv

# JSONL (pass-through, default)
bunx @plaited/agent-eval-harness format results.jsonl --style jsonl

Compare Command

Compare multiple runs of the same prompts. Supports both CaptureResult (single-run) and TrialResult (multi-run reliability) formats with auto-detection.

# Default: auto-detect format, weighted strategy, JSON output
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

# Statistical significance strategy
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --strategy statistical -o comparison.json

# Custom weights via environment variables (CaptureResult)
COMPARE_QUALITY=0.7 COMPARE_LATENCY=0.2 COMPARE_RELIABILITY=0.1 \
  bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

# Markdown report format
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --format markdown -o report.md

# Custom grader (LLM-as-Judge)
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl \
  --strategy custom --grader ./my-llm-judge.ts -o comparison.json

# With explicit labels
bunx @plaited/agent-eval-harness compare \
  --run "with-mcp:results-mcp.jsonl" \
  --run "vanilla:results-vanilla.jsonl" \
  -o comparison.json

Use cases for compare:

Same agent, different MCP servers
Same agent, different skills enabled
Same agent, different model versions
Different agents entirely

Trials Comparison (pass@k Analysis)

Compare TrialResult files for reliability analysis:

# Auto-detect trials format
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json

# Explicit format (skip auto-detection)
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl --input-format trials -o comparison.json

# Custom weights for trials comparison
COMPARE_CAPABILITY=0.5 COMPARE_RELIABILITY=0.3 COMPARE_CONSISTENCY=0.2 \
  bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json

Trials metrics:

Metric	Description	Formula
Capability (passAtK)	Can solve at least once in K tries	`1 - (1-p)^k`
Reliability (passExpK)	Solves consistently every time	`p^k`
Flakiness	Gap between capability and reliability	`passAtK - passExpK`
Quality (scores)	Aggregate grader scores across trials	avg/median/p25/p75 (only with grader)
Performance (latency)	Aggregate trial durations	p50/p90/p99/mean/min/max (always present)

Built-in Comparison Strategies

For CaptureResult (single-run):

Strategy	Description	Env Vars
`weighted` (default)	Quality, latency, reliability	`COMPARE_QUALITY`, `COMPARE_LATENCY`, `COMPARE_RELIABILITY`
`statistical`	Bootstrap for confidence intervals	`COMPARE_BOOTSTRAP_ITERATIONS`
`custom`	Your own grader	`--grader path`

For TrialResult (multi-run):

Strategy	Description	Env Vars
`weighted` (default)	Capability, reliability, consistency	`COMPARE_CAPABILITY`, `COMPARE_RELIABILITY`, `COMPARE_CONSISTENCY`
`statistical`	Bootstrap passAtK confidence intervals	`COMPARE_BOOTSTRAP_ITERATIONS`
`custom`	Your own grader	`--grader path`

Comparison Report Output

CaptureResult format outputs ComparisonReport:

{
  "meta": { "generatedAt": "...", "runs": ["baseline", "variant"], "promptCount": 100 },
  "quality": { "baseline": { "avgScore": 0.85, "passRate": 0.82 }, "variant": { ... } },
  "performance": { "baseline": { "latency": { "p50": 1200, "p90": 3400 } }, ... },
  "reliability": { "baseline": { "type": "run", "toolErrors": 5, "completionRate": 0.99 }, ... },
  "headToHead": { "pairwise": [{ "runA": "baseline", "runB": "variant", "aWins": 35, "bWins": 55 }] }
}

With --strategy statistical, quality and performance metrics include 95% confidence intervals:

{
  "quality": {
    "baseline": {
      "avgScore": 0.85,
      "passRate": 0.82,
      "confidenceIntervals": {
        "avgScore": [0.82, 0.88],
        "passRate": [0.79, 0.85]
      }
    }
  },
  "performance": {
    "baseline": {
      "latency": { "p50": 1200, "mean": 1350 },
      "confidenceIntervals": {
        "latencyMean": [1280, 1420]
      }
    }
  }
}

TrialResult format outputs TrialsComparisonReport:

{
  "meta": { "generatedAt": "...", "runs": ["claude", "gemini"], "promptCount": 50, "trialsPerPrompt": 5, "inputFormat": "trials" },
  "capability": { "claude": { "avgPassAtK": 0.92, "medianPassAtK": 0.95 }, "gemini": { "..." : "..." } },
  "reliability": { "claude": { "type": "trial", "avgPassExpK": 0.78, "medianPassExpK": 0.82 }, "gemini": { "..." : "..." } },
  "flakiness": { "claude": { "avgFlakiness": 0.14, "flakyPromptCount": 12 }, "gemini": { "..." : "..." } },
  "quality": { "claude": { "avgScore": 0.85, "medianScore": 0.90, "p25Score": 0.75, "p75Score": 0.95 }, "gemini": { "..." : "..." } },
  "performance": { "claude": { "latency": { "p50": 1200, "p90": 3400, "p99": 5100, "mean": 1500, "min": 800, "max": 5200 }, "totalDuration": 375000 }, "gemini": { "..." : "..." } },
  "headToHead": {
    "capability": [{ "runA": "claude", "runB": "gemini", "aWins": 28, "bWins": 18, "ties": 4 }],
    "reliability": ["..."],
    "overall": ["..."]
  }
}

Notes:

quality is only present when a grader was used (trials have score fields)
performance is always present (every trial has duration)

With --strategy statistical, capability, reliability, quality, and performance metrics include 95% confidence intervals:

{
  "capability": {
    "claude": {
      "avgPassAtK": 0.92,
      "confidenceIntervals": { "avgPassAtK": [0.88, 0.95] }
    }
  },
  "reliability": {
    "claude": {
      "type": "trial",
      "avgPassExpK": 0.78,
      "confidenceIntervals": { "avgPassExpK": [0.72, 0.84] }
    }
  },
  "quality": {
    "claude": {
      "avgScore": 0.85,
      "confidenceIntervals": { "avgScore": [0.82, 0.88] }
    }
  },
  "performance": {
    "claude": {
      "latency": { "mean": 1500 },
      "confidenceIntervals": { "latencyMean": [1380, 1620] }
    }
  }
}

See comparison-graders.md for complete comparison grader documentation including LLM-as-Judge patterns.

Comparison Grader Interface

CaptureResult grader:

import type { ComparisonGrader } from '@plaited/agent-eval-harness/pipeline'

export const grade: ComparisonGrader = async ({ id, input, hint, runs }) => {
  // runs is Record<string, { output, trajectory?, score?, duration?, toolErrors? }>
  return {
    rankings: [
      { run: 'with-mcp', rank: 1, score: 0.9 },
      { run: 'vanilla', rank: 2, score: 0.7 },
    ],
    reasoning: 'MCP run produced more accurate output'
  }
}

TrialResult grader:

import type { TrialsComparisonGrader } from '@plaited/agent-eval-harness/pipeline'

export const grade: TrialsComparisonGrader = async ({ id, input, hint, runs }) => {
  // runs is Record<string, { passAtK?, passExpK?, k, trials }>
  // Each trial in trials has: { duration, score?, pass?, output, trajectory }
  return {
    rankings: [
      { run: 'claude', rank: 1, score: 0.92 },
      { run: 'gemini', rank: 2, score: 0.85 },
    ],
    reasoning: 'Claude has higher reliability with lower flakiness'
  }
}

Pipeline Workflow Diagram

flowchart LR
    Prompts["prompts.jsonl"] --> Run["run"]
    Schema["headless schema"] --> Run
    Run --> Raw["raw.jsonl"]
    Raw --> Extract["extract"]
    Schema --> Extract
    Extract --> Extracted["extracted.jsonl"]
    Extracted --> Grade["grade"]
    Grader["grader.ts"] --> Grade
    Grade --> Graded["graded.jsonl"]
    Graded --> Format["format"]
    Format --> Output["report.md / .csv / .jsonl"]

    Graded --> Compare["compare"]
    Results2["other runs..."] --> Compare
    CompareGrader["compare-grader.ts"] --> Compare
    Compare --> Comparison["comparison.jsonl"]

Schemas Command

Export JSON schemas for non-TypeScript tools.

# List available schemas
bunx @plaited/agent-eval-harness schemas

# Export all schemas as JSON
bunx @plaited/agent-eval-harness schemas --json -o schemas.json

# Export specific schema
bunx @plaited/agent-eval-harness schemas CaptureResult --json
bunx @plaited/agent-eval-harness schemas TrialResult --json
bunx @plaited/agent-eval-harness schemas GraderResult --json

Available Schemas

Schema	Description
`CaptureResult`	Single capture output (id, input, output, trajectory, timing)
`TrialResult`	Multi-run trial output (includes passAtK, passExpK)
`GraderResult`	Grader return value (pass, score, reasoning)
`PromptInput`	Input prompt format
`TrajectoryStep`	Single step in trajectory array
`SummaryResult`	Compact summary format

Usage in Other Languages

Export schemas for validation in Python, Go, etc.:

# Export all schemas
bunx @plaited/agent-eval-harness schemas --json -o schemas.json

# Use in Python with jsonschema
python -c "
import json
from jsonschema import validate

with open('schemas.json') as f:
    schemas = json.load(f)

with open('results.jsonl') as f:
    for line in f:
        result = json.loads(line)
        validate(result, schemas['CaptureResult'])
        print(f'{result[\"id\"]}: valid')
"

Grader Interface

Graders provide semantic pass/fail scoring for captured trajectories. The harness supports graders written in any language.

Git-Based Grading (Recommended for Coding Tasks)

Grade outcomes, not paths. Use the optional cwd parameter to detect environmental changes with git:

// git-grader.ts
import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ output, hint, cwd }) => {
  if (!cwd) return { pass: false, score: 0, reasoning: 'No cwd' }
  
  // Detect file changes
  const status = await Bun.$`git -C ${cwd} status --porcelain`.text()
  const filesCreated = status
    .split('\n')
    .filter(line => line.startsWith('??'))
    .map(line => line.slice(3).trim())
  
  // Verify tests pass
  const testResult = await Bun.$`cd ${cwd} && bun test`.nothrow()
  
  return {
    pass: filesCreated.length > 0 && testResult.exitCode === 0,
    score: testResult.exitCode === 0 ? 1 : 0,
    reasoning: `Files: ${filesCreated.join(', ')}. Tests: ${testResult.exitCode === 0 ? 'pass' : 'fail'}`,
    outcome: {  // Optional: structured data for analysis
      filesCreated,
      testsPassed: testResult.exitCode === 0,
      type: 'file_creation_with_tests'
    }
  }
}

See inline-graders.md for comprehensive git-based grading patterns.

Output-Based Grading (General Purpose)

// my-grader.ts
import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ input, output, hint, trajectory }) => {
  const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '')
  return {
    pass,
    score: pass ? 1 : 0,
    reasoning: pass ? 'Contains hint content' : 'Missing hint content'
  }
}

Note: input can be string (single turn) or string[] (multi-turn). The hint field provides grader context (renamed from expected).

Python/Executable Graders

Any executable can be a grader using stdin/stdout JSON protocol:

#!/usr/bin/env python3
import json, sys

data = json.load(sys.stdin)
output = data.get("output", "").lower()
hint = (data.get("hint") or "").lower()

pass_result = hint in output if hint else True
print(json.dumps({
    "pass": pass_result,
    "score": 1.0 if pass_result else 0.0,
    "reasoning": "Contains hint" if pass_result else "Missing hint"
}))

chmod +x ./grader.py
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl

See inline-graders.md for complete grader documentation including LLM-as-Judge patterns.

Input Format

Each line in prompts.jsonl:

{"id":"test-001","input":"Create a button","hint":"should contain <button>"}
{"id":"test-002","input":["Create a button","Make it blue"],"metadata":{"category":"ui"}}

Field	Required	Description
`id`	Yes	Unique identifier
`input`	Yes	Single prompt (string) or conversation turns (string[])
`hint`	No	Grader context - what to look for (not strict match)
`reference`	No	Reference solution (for validate-refs)
`metadata`	No	Tags, category, difficulty for filtering
`timeout`	No	Override default timeout for this prompt

Session behavior: Each JSONL entry = 1 fresh session

input: string → 1 session, 1 prompt
input: string[] → 1 session, N prompts (sequential turns)

Output Format

Full trajectory JSONL (always):

{
  "id": "test-001",
  "input": "Find the CEO of Anthropic",
  "output": "The CEO of Anthropic is Dario Amodei.",
  "hint": "should mention Dario Amodei",
  "trajectory": [
    {"type": "thought", "content": "I'll search for this...", "timestamp": 100},
    {"type": "tool_call", "name": "WebSearch", "status": "completed", "input": {...}, "output": {...}, "duration": 500},
    {"type": "message", "content": "The CEO of Anthropic is Dario Amodei.", "timestamp": 700}
  ],
  "metadata": {
    "category": "search",
    "agent": "--schema ./claude.json",
    "trajectoryRichness": "full",
    "turnCount": 1
  },
  "timing": {
    "start": 1704067200000,
    "end": 1704067201234,
    "firstResponse": 100,
    "sessionCreation": 234,
    "total": 1234,
    "inputTokens": 150,
    "outputTokens": 85
  },
  "toolErrors": false
}

Output Fields

Field	Description
`input`	Original prompt (string or string[] for multi-turn)
`hint`	Grader context hint (if provided)
`metadata.trajectoryRichness`	`"full"` \| `"messages-only"` \| `"minimal"`
`metadata.turnCount`	Number of conversation turns (1 for string, N for array)
`timing.sessionCreation`	Time to create session (ms)
`timing.total`	Total duration (end - start)
`timing.inputTokens`	Input tokens consumed (if available from adapter)
`timing.outputTokens`	Output tokens generated (if available from adapter)
`toolErrors`	Whether any tool calls failed

Note: toolErrors replaces misleading status: 'passed'|'failed'. Real pass/fail comes from YOUR grader.

Schema Exports

Consumers can import Zod schemas directly:

import { CaptureResultSchema, TrialResultSchema } from '@plaited/agent-eval-harness/schemas'

// Validate external data
const result = CaptureResultSchema.parse(jsonData)

// Generate JSON Schema (Zod 4 native)
import { z } from 'zod'
const jsonSchema = z.toJSONSchema(CaptureResultSchema)

Discriminated Unions for Reliability Metrics

Reliability metrics include a type discriminator for type-safe parsing:

import { z } from 'zod'
import {
  ReliabilityMetricsSchema,       // type: 'run'
  TrialsReliabilityMetricsSchema  // type: 'trial'
} from '@plaited/agent-eval-harness/schemas'

// Create a unified schema for both metric types
const UnifiedReliabilitySchema = z.discriminatedUnion('type', [
  ReliabilityMetricsSchema,
  TrialsReliabilityMetricsSchema,
])

// Type-safe parsing with automatic narrowing
const metrics = UnifiedReliabilitySchema.parse(data)
if (metrics.type === 'run') {
  // TypeScript knows: ReliabilityMetrics
  console.log(metrics.toolErrors, metrics.completionRate)
} else {
  // TypeScript knows: TrialsReliabilityMetrics
  console.log(metrics.avgPassExpK, metrics.medianPassExpK)
}

Or export JSON schemas for non-TypeScript tools:

bunx @plaited/agent-eval-harness schemas --json -o schemas.json
bunx @plaited/agent-eval-harness schemas CaptureResult --json

Execution Environment

Recommendation: Run the harness in Docker containers for consistent, isolated execution.

# Run integration tests via Docker
docker compose -f docker-compose.test.yml run --rm test

# Or with explicit API keys
ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm test

Docker Requirements

Requirement	Reason
Node.js 24+	Gemini CLI uses modern JS features (optional chaining)
Non-root user	Claude CLI blocks `--dangerously-skip-permissions` as root
Gemini API key	Pass `GEMINI_API_KEY` for Gemini CLI

See docker-evals.md for complete Docker setup guide, debugging tips, and CI integration patterns.

Multi-turn Conversations

Use input: string[] to execute multi-turn conversations within a single session:

{"id":"context-001","input":["Remember this number: 42","What number did I ask you to remember?"],"hint":"42"}
{"id":"context-002","input":["My name is Alice","What is my name?"],"hint":"Alice"}

Run with the headless adapter:

# Using Claude Code via headless adapter
bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
  bunx @plaited/agent-eval-harness headless --schema ./claude-headless.json \
  -o results.jsonl

# Using Gemini CLI via headless adapter
GEMINI_API_KEY=... bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
  bunx @plaited/agent-eval-harness headless --schema ./gemini-headless.json \
  -o results.jsonl

Key points:

Each JSONL entry = 1 fresh session
input: string[] sends sequential turns to the same session
Works with both stream mode (Claude) and iterative mode (Gemini)
The adapter handles context preservation automatically

Downstream Integration

The harness outputs standard JSONL that pipes to any tool:

# Filter with jq
cat results.jsonl | jq 'select(.metadata.category == "ui")'

# Count tool usage
cat results.jsonl | jq -s 'map(.trajectory | map(select(.type == "tool_call")) | length) | add'

# Summarize for quick analysis
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

# Compare runs with built-in strategies
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

Quick Reference

Resource	Description
`bunx @plaited/agent-eval-harness`	CLI help
output-formats.md	JSONL schemas, command details
inline-graders.md	Single input/output graders (TypeScript, Python, shell)
comparison-graders.md	Comparison strategies (weighted, statistical, LLM-as-Judge)
calibration.md	Grader calibration workflow
eval-concepts.md	Evaluation concepts (pass@k, pass^k)
docker-evals.md	Docker setup, debugging, CI integration

headless-adapters skill - Schema-driven adapters for headless CLI agents

agent-eval-harness

Más de este repositorio

Agent Eval Harness

Purpose

Installation

Core Principle: Capture Once, Derive Many Views

Commands

Core Commands

Pipeline Commands (Unix-style)

Capture Command

Basic Usage

Arguments

Examples

Trials Command

Output

Summarize Command

Calibrate Command

Validate-Refs Command

Why Use This?

Input Format

Output Format

Balance Command

Why Use This?

Output Format

Balanced Eval Design

Pipeline Workflow

Full Pipeline Example

Run Command

Extract Command

Grade Command

Format Command

Compare Command

Trials Comparison (pass@k Analysis)

Built-in Comparison Strategies

Comparison Report Output

Comparison Grader Interface

Pipeline Workflow Diagram

Schemas Command

Available Schemas

Usage in Other Languages

Grader Interface

Git-Based Grading (Recommended for Coding Tasks)

Output-Based Grading (General Purpose)

Python/Executable Graders

Input Format

Output Format

Output Fields

Schema Exports

Discriminated Unions for Reliability Metrics

Execution Environment

Docker Requirements

Multi-turn Conversations

Downstream Integration

Quick Reference

Related

Agent Eval Harness

Purpose

Installation

Core Principle: Capture Once, Derive Many Views

Commands

Core Commands

Pipeline Commands (Unix-style)

Capture Command

Basic Usage

Arguments

Examples

Trials Command

Output

Summarize Command

Calibrate Command

Validate-Refs Command

Why Use This?

Input Format

Output Format

Balance Command

Why Use This?

Output Format

Balanced Eval Design

Pipeline Workflow

Full Pipeline Example