| name | agent-eval-harness |
| description | CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. |
| compatibility | Bun >= 1.2.9 |
Agent Eval Harness
Purpose
CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun.
The harness captures. You score.
| Harness Provides | You Provide |
|---|
| Prompt execution via headless adapters | Scoring logic (Braintrust, custom scripts) |
| Full trajectory capture (thoughts, tools, plans) | Pass/fail determination via graders |
| Structured JSONL output | LLM-as-judge prompts |
| Reproducible execution environment | CI integration, golden file comparison |
Use this when:
- Capturing trajectories for downstream evaluation
- Generating training data (SFT/DPO) with full context
- Building regression test fixtures for agent behavior
- Comparing agent responses across configurations
Installation
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl
bun add @plaited/agent-eval-harness
Core Principle: Capture Once, Derive Many Views
flowchart LR
Prompts["prompts.jsonl"] --> Capture["capture/trials"]
Schema["headless schema"] --> Capture
Capture --> Results["results.jsonl (full trajectory)"]
Results --> Summarize["summarize"]
Results --> Calibrate["calibrate"]
Results --> Custom["(your tools)"]
Summarize --> Views["summary.jsonl / .md"]
Calibrate --> Report["calibration.md"]
Custom --> Pipeline["any scoring platform"]
Single output format: Full trajectory JSONL (always)
No --format flag: Derive views with separate commands
Schema exports: Zod schemas + JSON Schema for any tooling
Commands
Core Commands
| Command | Input | Output | Purpose |
|---|
capture | prompts.jsonl + schema | results.jsonl | Trajectory capture (full) |
trials | prompts.jsonl + schema | trials.jsonl | Multi-run + optional metrics |
summarize | results.jsonl | summary.jsonl or .md | Derive compact views |
calibrate | results.jsonl | calibration.md | Sample failures for review |
validate-refs | prompts.jsonl | validation.jsonl | Check reference solutions |
balance | prompts.jsonl | balance.json | Analyze test set coverage |
schemas | (none) | JSON Schema | Export schemas for non-TS users |
Pipeline Commands (Unix-style)
| Command | Input | Output | Purpose |
|---|
run | prompts.jsonl + schema | raw.jsonl | Execute prompts, raw output |
extract | raw.jsonl + schema | extracted.jsonl | Parse trajectories |
grade | extracted.jsonl + grader | graded.jsonl | Apply grader scoring |
format | results.jsonl | jsonl/markdown/csv | Convert output format |
compare | multiple results.jsonl | comparison.json | Compare runs (aggregate report) |
All commands support optional --grader ./grader.ts for scoring.
Capture Command
Basic Usage
bunx @plaited/agent-eval-harness capture <prompts.jsonl> --schema <schema.json> [options]
Arguments
| Argument/Flag | Description | Default |
|---|
prompts.jsonl | Input file with prompts to execute | Required |
-s, --schema | Path to headless adapter schema | Required |
-o, --output | Output file/path | stdout |
-c, --cwd | Working directory for agent | current |
-t, --timeout | Request timeout in ms | 60000 |
-j, --concurrency | Number of concurrent workers | 1 |
--workspace-dir | Base directory for per-prompt workspace isolation | none |
--progress | Show progress to stderr | false |
--append | Append to output file | false |
-g, --grader | Path to grader module | none |
--debug | Show detailed CLI output for debugging | false |
Examples
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -j 4 -o results.jsonl
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json \
-j 4 --workspace-dir ./workspaces -o results.jsonl
bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl
Trials Command
Run each prompt multiple times for pass@k/pass^k analysis.
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4 -o trials.jsonl
bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -j 4 \
--workspace-dir ./workspaces -o trials.jsonl
Parallelization notes:
-j/--concurrency parallelizes across prompts (not trials within a prompt)
- Each prompt's k trials still run sequentially (required for aggregation)
- With 151 prompts and
-j 4, you get 4 prompts running trials concurrently
--workspace-dir creates {workspace-dir}/prompt-{id}-trial-{n}/ for each trial
- Progress logging shows aggregate completion (e.g.,
12/50 prompts completed)
Workspace cleanup:
Directories persist after completion for debugging. Clean up manually:
rm -rf ./workspaces
- run: rm -rf ./workspaces
if: always()
Output
Without grader:
{"id":"search-001","input":"Find the CEO","k":5,"trials":[{"trialNum":1,"output":"...","trajectory":[...],"duration":1234},...]}
With grader:
{"id":"search-001","input":"Find the CEO","k":5,"passRate":0.8,"passAtK":0.99,"passExpK":0.33,"trials":[{"trialNum":1,"output":"...","pass":true,"score":1.0},...]}
Summarize Command
Derive compact views from full trajectory results.
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl
bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md
Calibrate Command
Sample failures for grader review. Calibration helps you distinguish between agent failures (agent did wrong thing) and grader bugs (agent was correct, grader too strict).
bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md
bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md
See eval-concepts.md for why calibration matters.
Validate-Refs Command
Check that reference solutions pass your grader before evaluating agents.
bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl
cat validation.jsonl | jq 'select(.pass == false)'
Why Use This?
If your reference solution fails your own grader:
- The task definition is ambiguous
- The grader is too strict
- The hint is wrong
Fix the eval before evaluating the agent.
Input Format
Prompts must include a reference field:
{"id":"test-001","input":"Create a button component","hint":"<button>","reference":"export const Button = () => <button>Click</button>"}
Output Format
{"id":"test-001","input":"Create a button component","reference":"export const Button = () => <button>Click</button>","pass":true,"score":1.0,"reasoning":"Contains hint content"}
Balance Command
Analyze test set coverage to ensure balanced evaluation.
bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json
bunx @plaited/agent-eval-harness balance prompts.jsonl | jq .
Why Use This?
An eval with only "make X work" misses "don't break Y". Balance analysis shows:
- Category distribution (from
metadata.category)
- Positive/negative case ratio
- Coverage gaps
Output Format
{
"totalCases": 50,
"categories": [
{ "name": "ui", "count": 20, "percentage": 40 },
{ "name": "logic", "count": 15, "percentage": 30 },
{ "name": "api", "count": 10, "percentage": 20 },
{ "name": "edge-case", "count": 5, "percentage": 10 }
],
"underrepresented": ["edge-case"],
"suggestions": ["Consider adding more test cases for: edge-case"]
}
Balanced Eval Design
Include both positive and negative cases:
| Type | Example | Purpose |
|---|
| Positive | "Add a login button" | Agent should succeed |
| Negative | "Add a button without breaking tests" | Agent should not break things |
| Edge case | "Handle empty input gracefully" | Agent should be robust |
See eval-concepts.md for more on balanced test sets.
Pipeline Workflow
The pipeline commands enable Unix-style composition for flexible evaluation workflows.
Full Pipeline Example
cat prompts.jsonl | \
bunx @plaited/agent-eval-harness run -s claude.json | \
bunx @plaited/agent-eval-harness extract -s claude.json | \
bunx @plaited/agent-eval-harness grade -g ./grader.ts | \
bunx @plaited/agent-eval-harness format -f markdown > report.md
Run Command
Execute prompts and output raw results. Three modes available:
bunx @plaited/agent-eval-harness run prompts.jsonl --schema claude.json
bunx @plaited/agent-eval-harness run prompts.jsonl --simple "claude -p {} --output-format stream-json"
bunx @plaited/agent-eval-harness run prompts.jsonl --shell 'claude -p "$PROMPT" --output-format stream-json'
⚠️ Security Warning: The --simple and --shell modes execute prompts via shell commands. Prompts are escaped but do not use untrusted prompt content with these modes. Malicious prompt text could potentially escape the quoting and execute arbitrary commands. Use --schema mode (headless adapter) for untrusted inputs.
Extract Command
Parse raw output into structured trajectories:
bunx @plaited/agent-eval-harness extract raw.jsonl --schema claude.json -o extracted.jsonl
bunx @plaited/agent-eval-harness run prompts.jsonl -s claude.json | \
bunx @plaited/agent-eval-harness extract -s claude.json
Grade Command
Apply grader to extracted results:
bunx @plaited/agent-eval-harness grade extracted.jsonl --grader ./grader.ts -o graded.jsonl
Format Command
Convert results to different output formats:
bunx @plaited/agent-eval-harness format results.jsonl --style markdown -o report.md
bunx @plaited/agent-eval-harness format results.jsonl --style csv -o results.csv
bunx @plaited/agent-eval-harness format results.jsonl --style jsonl
Compare Command
Compare multiple runs of the same prompts. Supports both CaptureResult (single-run) and TrialResult (multi-run reliability) formats with auto-detection.
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --strategy statistical -o comparison.json
COMPARE_QUALITY=0.7 COMPARE_LATENCY=0.2 COMPARE_RELIABILITY=0.1 \
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl --format markdown -o report.md
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl \
--strategy custom --grader ./my-llm-judge.ts -o comparison.json
bunx @plaited/agent-eval-harness compare \
--run "with-mcp:results-mcp.jsonl" \
--run "vanilla:results-vanilla.jsonl" \
-o comparison.json
Use cases for compare:
- Same agent, different MCP servers
- Same agent, different skills enabled
- Same agent, different model versions
- Different agents entirely
Trials Comparison (pass@k Analysis)
Compare TrialResult files for reliability analysis:
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl --input-format trials -o comparison.json
COMPARE_CAPABILITY=0.5 COMPARE_RELIABILITY=0.3 COMPARE_CONSISTENCY=0.2 \
bunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json
Trials metrics:
| Metric | Description | Formula |
|---|
| Capability (passAtK) | Can solve at least once in K tries | 1 - (1-p)^k |
| Reliability (passExpK) | Solves consistently every time | p^k |
| Flakiness | Gap between capability and reliability | passAtK - passExpK |
| Quality (scores) | Aggregate grader scores across trials | avg/median/p25/p75 (only with grader) |
| Performance (latency) | Aggregate trial durations | p50/p90/p99/mean/min/max (always present) |
Built-in Comparison Strategies
For CaptureResult (single-run):
| Strategy | Description | Env Vars |
|---|
weighted (default) | Quality, latency, reliability | COMPARE_QUALITY, COMPARE_LATENCY, COMPARE_RELIABILITY |
statistical | Bootstrap for confidence intervals | COMPARE_BOOTSTRAP_ITERATIONS |
custom | Your own grader | --grader path |
For TrialResult (multi-run):
| Strategy | Description | Env Vars |
|---|
weighted (default) | Capability, reliability, consistency | COMPARE_CAPABILITY, COMPARE_RELIABILITY, COMPARE_CONSISTENCY |
statistical | Bootstrap passAtK confidence intervals | COMPARE_BOOTSTRAP_ITERATIONS |
custom | Your own grader | --grader path |
Comparison Report Output
CaptureResult format outputs ComparisonReport:
{
"meta": { "generatedAt": "...", "runs": ["baseline", "variant"], "promptCount": 100 },
"quality": { "baseline": { "avgScore": 0.85, "passRate": 0.82 }, "variant": { ... } },
"performance": { "baseline": { "latency": { "p50": 1200, "p90": 3400 } }, ... },
"reliability": { "baseline": { "type": "run", "toolErrors": 5, "completionRate": 0.99 }, ... },
"headToHead": { "pairwise": [{ "runA": "baseline", "runB": "variant", "aWins": 35, "bWins": 55 }] }
}
With --strategy statistical, quality and performance metrics include 95% confidence intervals:
{
"quality": {
"baseline": {
"avgScore": 0.85,
"passRate": 0.82,
"confidenceIntervals": {
"avgScore": [0.82, 0.88],
"passRate": [0.79, 0.85]
}
}
},
"performance": {
"baseline": {
"latency": { "p50": 1200, "mean": 1350 },
"confidenceIntervals": {
"latencyMean": [1280, 1420]
}
}
}
}
TrialResult format outputs TrialsComparisonReport:
{
"meta": { "generatedAt": "...", "runs": ["claude", "gemini"], "promptCount": 50, "trialsPerPrompt": 5, "inputFormat": "trials" },
"capability": { "claude": { "avgPassAtK": 0.92, "medianPassAtK": 0.95 }, "gemini": { "..." : "..." } },
"reliability": { "claude": { "type": "trial", "avgPassExpK": 0.78, "medianPassExpK": 0.82 }, "gemini": { "..." : "..." } },
"flakiness": { "claude": { "avgFlakiness": 0.14, "flakyPromptCount": 12 }, "gemini": { "..." : "..." } },
"quality": { "claude": { "avgScore": 0.85, "medianScore": 0.90, "p25Score": 0.75, "p75Score": 0.95 }, "gemini": { "..." : "..." } },
"performance": { "claude": { "latency": { "p50": 1200, "p90": 3400, "p99": 5100, "mean": 1500, "min": 800, "max": 5200 }, "totalDuration": 375000 }, "gemini": { "..." : "..." } },
"headToHead": {
"capability": [{ "runA": "claude", "runB": "gemini", "aWins": 28, "bWins": 18, "ties": 4 }],
"reliability": ["..."],
"overall": ["..."]
}
}
Notes:
quality is only present when a grader was used (trials have score fields)
performance is always present (every trial has duration)
With --strategy statistical, capability, reliability, quality, and performance metrics include 95% confidence intervals:
{
"capability": {
"claude": {
"avgPassAtK": 0.92,
"confidenceIntervals": { "avgPassAtK": [0.88, 0.95] }
}
},
"reliability": {
"claude": {
"type": "trial",
"avgPassExpK": 0.78,
"confidenceIntervals": { "avgPassExpK": [0.72, 0.84] }
}
},
"quality": {
"claude": {
"avgScore": 0.85,
"confidenceIntervals": { "avgScore": [0.82, 0.88] }
}
},
"performance": {
"claude": {
"latency": { "mean": 1500 },
"confidenceIntervals": { "latencyMean": [1380, 1620] }
}
}
}
See comparison-graders.md for complete comparison grader documentation including LLM-as-Judge patterns.
Comparison Grader Interface
CaptureResult grader:
import type { ComparisonGrader } from '@plaited/agent-eval-harness/pipeline'
export const grade: ComparisonGrader = async ({ id, input, hint, runs }) => {
return {
rankings: [
{ run: 'with-mcp', rank: 1, score: 0.9 },
{ run: 'vanilla', rank: 2, score: 0.7 },
],
reasoning: 'MCP run produced more accurate output'
}
}
TrialResult grader:
import type { TrialsComparisonGrader } from '@plaited/agent-eval-harness/pipeline'
export const grade: TrialsComparisonGrader = async ({ id, input, hint, runs }) => {
return {
rankings: [
{ run: 'claude', rank: 1, score: 0.92 },
{ run: 'gemini', rank: 2, score: 0.85 },
],
reasoning: 'Claude has higher reliability with lower flakiness'
}
}
Pipeline Workflow Diagram
flowchart LR
Prompts["prompts.jsonl"] --> Run["run"]
Schema["headless schema"] --> Run
Run --> Raw["raw.jsonl"]
Raw --> Extract["extract"]
Schema --> Extract
Extract --> Extracted["extracted.jsonl"]
Extracted --> Grade["grade"]
Grader["grader.ts"] --> Grade
Grade --> Graded["graded.jsonl"]
Graded --> Format["format"]
Format --> Output["report.md / .csv / .jsonl"]
Graded --> Compare["compare"]
Results2["other runs..."] --> Compare
CompareGrader["compare-grader.ts"] --> Compare
Compare --> Comparison["comparison.jsonl"]
Schemas Command
Export JSON schemas for non-TypeScript tools.
bunx @plaited/agent-eval-harness schemas
bunx @plaited/agent-eval-harness schemas --json -o schemas.json
bunx @plaited/agent-eval-harness schemas CaptureResult --json
bunx @plaited/agent-eval-harness schemas TrialResult --json
bunx @plaited/agent-eval-harness schemas GraderResult --json
Available Schemas
| Schema | Description |
|---|
CaptureResult | Single capture output (id, input, output, trajectory, timing) |
TrialResult | Multi-run trial output (includes passAtK, passExpK) |
GraderResult | Grader return value (pass, score, reasoning) |
PromptInput | Input prompt format |
TrajectoryStep | Single step in trajectory array |
SummaryResult | Compact summary format |
Usage in Other Languages
Export schemas for validation in Python, Go, etc.:
bunx @plaited/agent-eval-harness schemas --json -o schemas.json
python -c "
import json
from jsonschema import validate
with open('schemas.json') as f:
schemas = json.load(f)
with open('results.jsonl') as f:
for line in f:
result = json.loads(line)
validate(result, schemas['CaptureResult'])
print(f'{result[\"id\"]}: valid')
"
Grader Interface
Graders provide semantic pass/fail scoring for captured trajectories. The harness supports graders written in any language.
Git-Based Grading (Recommended for Coding Tasks)
Grade outcomes, not paths. Use the optional cwd parameter to detect environmental changes with git:
import type { Grader } from '@plaited/agent-eval-harness/schemas'
export const grade: Grader = async ({ output, hint, cwd }) => {
if (!cwd) return { pass: false, score: 0, reasoning: 'No cwd' }
const status = await Bun.$`git -C ${cwd} status --porcelain`.text()
const filesCreated = status
.split('\n')
.filter(line => line.startsWith('??'))
.map(line => line.slice(3).trim())
const testResult = await Bun.$`cd ${cwd} && bun test`.nothrow()
return {
pass: filesCreated.length > 0 && testResult.exitCode === 0,
score: testResult.exitCode === 0 ? 1 : 0,
reasoning: `Files: ${filesCreated.join(', ')}. Tests: ${testResult.exitCode === 0 ? 'pass' : 'fail'}`,
outcome: {
filesCreated,
testsPassed: testResult.exitCode === 0,
type: 'file_creation_with_tests'
}
}
}
See inline-graders.md for comprehensive git-based grading patterns.
Output-Based Grading (General Purpose)
import type { Grader } from '@plaited/agent-eval-harness/schemas'
export const grade: Grader = async ({ input, output, hint, trajectory }) => {
const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '')
return {
pass,
score: pass ? 1 : 0,
reasoning: pass ? 'Contains hint content' : 'Missing hint content'
}
}
Note: input can be string (single turn) or string[] (multi-turn). The hint field provides grader context (renamed from expected).
Python/Executable Graders
Any executable can be a grader using stdin/stdout JSON protocol:
import json, sys
data = json.load(sys.stdin)
output = data.get("output", "").lower()
hint = (data.get("hint") or "").lower()
pass_result = hint in output if hint else True
print(json.dumps({
"pass": pass_result,
"score": 1.0 if pass_result else 0.0,
"reasoning": "Contains hint" if pass_result else "Missing hint"
}))
chmod +x ./grader.py
bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl
See inline-graders.md for complete grader documentation including LLM-as-Judge patterns.
Input Format
Each line in prompts.jsonl:
{"id":"test-001","input":"Create a button","hint":"should contain <button>"}
{"id":"test-002","input":["Create a button","Make it blue"],"metadata":{"category":"ui"}}
| Field | Required | Description |
|---|
id | Yes | Unique identifier |
input | Yes | Single prompt (string) or conversation turns (string[]) |
hint | No | Grader context - what to look for (not strict match) |
reference | No | Reference solution (for validate-refs) |
metadata | No | Tags, category, difficulty for filtering |
timeout | No | Override default timeout for this prompt |
Session behavior: Each JSONL entry = 1 fresh session
input: string → 1 session, 1 prompt
input: string[] → 1 session, N prompts (sequential turns)
Output Format
Full trajectory JSONL (always):
{
"id": "test-001",
"input": "Find the CEO of Anthropic",
"output": "The CEO of Anthropic is Dario Amodei.",
"hint": "should mention Dario Amodei",
"trajectory": [
{"type": "thought", "content": "I'll search for this...", "timestamp": 100},
{"type": "tool_call", "name": "WebSearch", "status": "completed", "input": {...}, "output": {...}, "duration": 500},
{"type": "message", "content": "The CEO of Anthropic is Dario Amodei.", "timestamp": 700}
],
"metadata": {
"category": "search",
"agent": "--schema ./claude.json",
"trajectoryRichness": "full",
"turnCount": 1
},
"timing": {
"start": 1704067200000,
"end": 1704067201234,
"firstResponse": 100,
"sessionCreation": 234,
"total": 1234,
"inputTokens": 150,
"outputTokens": 85
},
"toolErrors": false
}
Output Fields
| Field | Description |
|---|
input | Original prompt (string or string[] for multi-turn) |
hint | Grader context hint (if provided) |
metadata.trajectoryRichness | "full" | "messages-only" | "minimal" |
metadata.turnCount | Number of conversation turns (1 for string, N for array) |
timing.sessionCreation | Time to create session (ms) |
timing.total | Total duration (end - start) |
timing.inputTokens | Input tokens consumed (if available from adapter) |
timing.outputTokens | Output tokens generated (if available from adapter) |
toolErrors | Whether any tool calls failed |
Note: toolErrors replaces misleading status: 'passed'|'failed'. Real pass/fail comes from YOUR grader.
Schema Exports
Consumers can import Zod schemas directly:
import { CaptureResultSchema, TrialResultSchema } from '@plaited/agent-eval-harness/schemas'
const result = CaptureResultSchema.parse(jsonData)
import { z } from 'zod'
const jsonSchema = z.toJSONSchema(CaptureResultSchema)
Discriminated Unions for Reliability Metrics
Reliability metrics include a type discriminator for type-safe parsing:
import { z } from 'zod'
import {
ReliabilityMetricsSchema,
TrialsReliabilityMetricsSchema
} from '@plaited/agent-eval-harness/schemas'
const UnifiedReliabilitySchema = z.discriminatedUnion('type', [
ReliabilityMetricsSchema,
TrialsReliabilityMetricsSchema,
])
const metrics = UnifiedReliabilitySchema.parse(data)
if (metrics.type === 'run') {
console.log(metrics.toolErrors, metrics.completionRate)
} else {
console.log(metrics.avgPassExpK, metrics.medianPassExpK)
}
Or export JSON schemas for non-TypeScript tools:
bunx @plaited/agent-eval-harness schemas --json -o schemas.json
bunx @plaited/agent-eval-harness schemas CaptureResult --json
Execution Environment
Recommendation: Run the harness in Docker containers for consistent, isolated execution.
docker compose -f docker-compose.test.yml run --rm test
ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... docker compose -f docker-compose.test.yml run --rm test
Docker Requirements
| Requirement | Reason |
|---|
| Node.js 24+ | Gemini CLI uses modern JS features (optional chaining) |
| Non-root user | Claude CLI blocks --dangerously-skip-permissions as root |
| Gemini API key | Pass GEMINI_API_KEY for Gemini CLI |
See docker-evals.md for complete Docker setup guide, debugging tips, and CI integration patterns.
Multi-turn Conversations
Use input: string[] to execute multi-turn conversations within a single session:
{"id":"context-001","input":["Remember this number: 42","What number did I ask you to remember?"],"hint":"42"}
{"id":"context-002","input":["My name is Alice","What is my name?"],"hint":"Alice"}
Run with the headless adapter:
bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
bunx @plaited/agent-eval-harness headless --schema ./claude-headless.json \
-o results.jsonl
GEMINI_API_KEY=... bunx @plaited/agent-eval-harness capture multi-turn.jsonl \
bunx @plaited/agent-eval-harness headless --schema ./gemini-headless.json \
-o results.jsonl
Key points:
- Each JSONL entry = 1 fresh session
input: string[] sends sequential turns to the same session
- Works with both
stream mode (Claude) and iterative mode (Gemini)
- The adapter handles context preservation automatically
Downstream Integration
The harness outputs standard JSONL that pipes to any tool:
cat results.jsonl | jq 'select(.metadata.category == "ui")'
cat results.jsonl | jq -s 'map(.trajectory | map(select(.type == "tool_call")) | length) | add'
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json
Quick Reference
Related