| name | web-search-agent-evals |
| description | Development assistant for web search agent evaluations across multiple CLI agents |
| compatibility | Bun >= 1.2.9 |
Web Search Agent Evaluations
Development assistant for running and comparing web search capabilities across CLI agents.
Overview
Evaluate 4 agents (Claude Code, Gemini, Droid, Codex) with 2 tools (builtin, You.com MCP) = 8 pairings.
Key Features:
- Headless adapters - Schema-driven CLI agent execution via @plaited/agent-eval-harness
- Flag-based architecture - Single service per agent, mode selected via environment variables
- Type-safe constants - MCP server definitions in TypeScript
- Isolated execution - Each pairing runs in its own Docker container
Architecture:
agent-schemas/ - Headless adapter JSON schemas
mcp-servers.ts - TypeScript MCP server constants
docker/entrypoint - Bun shell script for runtime config
scripts/ - Type-safe execution and comparison CLI tools
docker/ - Container infrastructure
Quick Commands
Run Evaluations
bun run trials
bun run trials -- --count 5 -k 1
bun run trials -- --agent claude-code --search-provider builtin
bun run trials -- --agent gemini --search-provider you
bun run trials -- --trial-type capability
bun run trials -- --trial-type regression
bun run trials -- -k 7
bun run trials -- -j 4
bun run trials -- --prompt-concurrency 4
docker compose run --rm -e SEARCH_PROVIDER=builtin claude-code
docker compose run --rm -e SEARCH_PROVIDER=you -e PROMPT_COUNT=5 gemini
Compare Results
Comparisons are written to data/comparisons/YYYY-MM-DD/.
bun run compare
bun run compare:stat
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid
bun run compare -- --search-provider builtin
bun run compare -- --trial-type capability
cat data/comparisons/2026-02-18/all-builtin-weighted.json | jq '.capability'
cat data/comparisons/2026-02-18/builtin-vs-you-weighted.json | jq '.headToHead.capability'
Comparison strategies:
weighted (default) - Capability, reliability, and consistency weighted scoring
statistical - Bootstrap sampling with 95% confidence intervals
Generate Report
Generate a comprehensive REPORT.md from comparison results:
bun run report
bun run report -- --run-date 2026-02-18
bun run report -- --dry-run
Report includes:
- Executive summary with best capability, reliability, and performance
- Quality rankings with pass@k and pass^k scores
- Performance rankings (latency P50/P90/P99)
- Flakiness analysis with top flaky prompts
- MCP tool impact analysis (builtin vs MCP comparison)
- Tool call statistics (P50/P90/P99/mean per provider)
- Tool call distribution histograms
- Failing prompts list (pass@k = 0%) with query text
Output: data/comparisons/YYYY-MM-DD/REPORT.md
Calibrate Grader
Interactive wizard to sample failures and review grader accuracy. Helps distinguish between agent failures (agent got it wrong) and grader bugs (agent was correct, grader too strict).
bun run calibrate
Interactive prompts:
- Run date - Select from available dated runs
- Agents - Multi-select via numbers or "all"
- Search providers - Multi-select via numbers or "all"
- Sample count - Number of failures to sample (default: 5)
Output: data/calibration/{date}-{agent}-{provider}.md
What calibration reveals:
- ❌ Grader too strict - Agent gave correct answer, grader rejected valid paraphrasing
- ❌ Hint too vague - Grader can't tell good from bad answers
- ✅ Real failures - Agent genuinely gave wrong/incomplete answer
See @agent-eval-harness calibration docs for grader calibration concepts.
Parallelization
The evaluation harness supports two-level parallelization for optimal performance:
Container-Level Concurrency (-j, --concurrency)
Controls how many Docker containers (agent×provider scenarios) run simultaneously.
bun run trials
bun run trials -- -j 4
bun run trials -- -j 1
Use cases:
- Unlimited (default) - All scenarios at once, I/O-bound workload handles it fine
-j 4 - Limit concurrency if hitting API rate limits
-j 2 - Conservative, for low-resource machines
-j 1 - Sequential execution for debugging
Prompt-Level Concurrency (--prompt-concurrency)
Controls how many prompts run in parallel within each container.
bun run trials -- --prompt-concurrency 4
bun run trials -- --prompt-concurrency 1
bun run trials -- --prompt-concurrency 8
How it works:
- Uses harness
-j flag with --workspace-dir for isolation
- Each prompt gets its own workspace directory
- Web searches are I/O-bound — parallel prompts maximize network bandwidth
Performance comparison:
| Config | Containers | Prompts/Container | Full (151 prompts, k=5) |
|---|
| Default | unlimited | 1 | ~2.5 hrs |
| Faster | unlimited | 4 | ~40 min |
| CI (high memory) | unlimited | 8 | ~20 min |
Warning: Stream-mode agents (claude-code, droid) use ~400-500MB RSS per prompt process. With --prompt-concurrency 8 that's 3-4GB per container — OOM kills likely in Docker (see issue #45)
Prompts
Prompts live in a flat data/prompts/ directory. The format differs by search provider:
- Builtin mode: Just the query (e.g., "What are the best free icon libraries...")
- MCP mode:
"Use {server-name} and answer\n{query}" with MCP metadata
| File | Prompts | Metadata | Use With |
|---|
prompts.jsonl | 151 | No MCP | SEARCH_PROVIDER=builtin |
prompts-you.jsonl | 151 | mcpServer="ydc-server", expectedTools=["you-search"] | SEARCH_PROVIDER=you |
The entrypoint automatically selects the correct prompt file based on SEARCH_PROVIDER. To run a random subset, pass PROMPT_COUNT (or --count N via CLI):
bun run trials -- --count 5
Results
All trial results are written to flat dated directories:
data/results/YYYY-MM-DD/
├── claude-code/
│ ├── builtin.jsonl
│ └── you.jsonl
├── gemini/
├── droid/
└── codex/
Each .jsonl line is a TrialResult:
{"id":"websearch-001","input":"...","k":5,"passRate":0.8,"passAtK":0.999,"passExpK":0.328,"trials":[...]}
Versioning:
git add data/results/ && git commit -m "feat: trial run YYYY-MM-DD"
Compare runs:
bun run compare
bun run compare -- --run-date 2026-02-18
Adding a New Agent
1. Create Headless Adapter Schema
Create agent-schemas/<agent>.json:
{
"command": ["<agent-cli>", "--flag", "{input}"],
"outputEvents": {
"match": { "path": "$.type" },
"patterns": {
"text": { "value": "text" },
"tool_call": { "value": "tool_call" },
"tool_result": { "value": "tool_result" }
}
},
"result": {
"contentPath": "$.output",
"errorPath": "$.error"
},
"mode": "stream",
"env": ["AGENT_API_KEY"]
}
Key fields:
command - CLI invocation with {input} placeholder
outputEvents.match.path - JSONPath to event type field
patterns - Map event types to standard names
result.contentPath - JSONPath to extract final output
mode - "stream" (persistent) or "iterative" (new process per turn)
env - Required environment variables
Test schema:
bunx @plaited/agent-eval-harness adapter:check -- \
bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json
2. Create Dockerfile
Create docker/<agent>.Dockerfile:
FROM base
# Install agent CLI
RUN npm install -g <agent-cli>
# Copy entrypoint and MCP config
COPY docker/entrypoint /entrypoint
COPY mcp-servers.ts /eval/mcp-servers.ts
RUN chmod +x /entrypoint
CMD ["/entrypoint"]
Verify installation:
docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version
3. Add Docker Compose Service
Add to docker-compose.yml:
<agent>:
build:
context: .
dockerfile: docker/<agent>.Dockerfile
volumes:
- ./agent-schemas:/eval/agent-schemas:ro
- ./data:/eval/data
- ./scripts:/eval/scripts:ro
working_dir: /workspace
env_file: .env
environment:
- AGENT=<agent>
- SEARCH_PROVIDER=${SEARCH_PROVIDER:-builtin}
4. Update TypeScript Entrypoint
Edit docker/entrypoint to add agent to configureMcp() function:
const configureMcp = async (agent: string, tool: McpServerKey): Promise<void> => {
const server = MCP_SERVERS[tool]
const apiKey = server.auth ? process.env[server.auth.envVar] : undefined
switch (agent) {
case '<agent>': {
await $`<agent> mcp add ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
console.log('✓ Agent MCP server added')
break
}
}
}
Add timeout if needed in buildTrialsCommand():
switch (AGENT) {
case '<agent>':
cmd.push('--timeout', '120000')
break
}
5. Update Scripts
Edit scripts/shared/shared.constants.ts to add agent to ALL_AGENTS:
export const ALL_AGENTS: Agent[] = ["claude-code", "gemini", "droid", "codex", "<agent>"]
Also update the Agent type in scripts/shared/shared.types.ts:
type Agent = "claude-code" | "gemini" | "droid" | "codex" | "<agent>"
6. Test
docker compose build <agent>
docker compose run --rm -e SEARCH_PROVIDER=builtin <agent>
docker compose run --rm -e SEARCH_PROVIDER=you <agent>
Adding a New MCP Tool
1. Add to mcp-servers.ts
export const MCP_SERVERS = {
you: { },
exa: {
name: 'exa-server',
type: 'http' as const,
url: 'https://api.exa.ai/mcp',
auth: {
type: 'bearer' as const,
envVar: 'EXA_API_KEY',
},
},
} as const
export type McpServerKey = keyof typeof MCP_SERVERS
2. Update docker/entrypoint
Add the new tool case to configureMcp() for each agent:
case 'claude-code': {
await $`claude mcp add --transport http ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'gemini': {
await $`gemini mcp add --transport http --header "Authorization: Bearer ${apiKey}" ${server.name} ${server.url}`.quiet()
break
}
case 'droid': {
await $`droid mcp add ${server.name} ${server.url} --type http --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'codex': {
const configDir = `${process.env.HOME}/.codex`
await $`mkdir -p ${configDir}`.quiet()
const config = `[mcp_servers.${server.name}]
url = "${server.url}"
bearer_token_env_var = "${server.auth?.envVar}"
`
await Bun.write(`${configDir}/config.toml`, config)
break
}
3. Update Environment Files
Add to .env and .env.example:
EXA_API_KEY=your_api_key_here
4. Generate MCP Prompt Sets
Use the generate-mcp-prompts script to create MCP variant files with proper metadata:
bun scripts/generate-mcp-prompts.ts --mcp-key exa
The script prepends "Use {server-name} and answer\n" to each query and adds MCP metadata (server name and expected tools).
The entrypoint automatically handles provider-specific prompt files:
const promptFile = SEARCH_PROVIDER === "builtin"
? `/eval/data/prompts/prompts.jsonl`
: `/eval/data/prompts/prompts-${SEARCH_PROVIDER}.jsonl`
Note: scripts/run-trials.ts automatically picks up new MCP servers from mcp-servers.ts, so no manual updates needed.
5. Test
docker compose build
docker compose run --rm -e SEARCH_PROVIDER=exa claude-code
bun run trials -- --search-provider exa --count 5 -k 1
Schema Format Reference
Current agent schemas:
| Schema | Agent | Mode | Status |
|---|
claude-code.json | Claude Code | stream | ✅ Tested |
gemini.json | Gemini CLI | iterative | ✅ Tested |
droid.json | Droid CLI | stream | ✅ Tested |
codex.json | Codex CLI | stream | ✅ Tested |
Session Modes:
stream - Process stays alive, multi-turn conversations via stdin
iterative - New process per turn, history passed as context
Related Skills