with one click
web-search-agent-evals
// Development assistant for web search agent evaluations across multiple CLI agents
// Development assistant for web search agent evaluations across multiple CLI agents
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | web-search-agent-evals |
| description | Development assistant for web search agent evaluations across multiple CLI agents |
| compatibility | Bun >= 1.2.9 |
Development assistant for running and comparing web search capabilities across CLI agents.
Evaluate 4 agents (Claude Code, Gemini, Droid, Codex) with 2 tools (builtin, You.com MCP) = 8 pairings.
Key Features:
Architecture:
agent-schemas/ - Headless adapter JSON schemasmcp-servers.ts - TypeScript MCP server constantsdocker/entrypoint - Bun shell script for runtime configscripts/ - Type-safe execution and comparison CLI toolsdocker/ - Container infrastructure# Full dataset (151 prompts), k=5 — all 8 agent×provider scenarios
bun run trials
# Quick smoke test (5 random prompts, single trial)
bun run trials -- --count 5 -k 1
# Specific agent or provider
bun run trials -- --agent claude-code --search-provider builtin
bun run trials -- --agent gemini --search-provider you
# Trial type presets
bun run trials -- --trial-type capability # k=10, deep exploration
bun run trials -- --trial-type regression # k=3, fast regression check
# Custom k value
bun run trials -- -k 7
# Control parallelism
bun run trials -- -j 4 # Limit to 4 containers
bun run trials -- --prompt-concurrency 4 # 4 prompts per container
# Direct Docker (manual testing)
docker compose run --rm -e SEARCH_PROVIDER=builtin claude-code
docker compose run --rm -e SEARCH_PROVIDER=you -e PROMPT_COUNT=5 gemini
Comparisons are written to data/comparisons/YYYY-MM-DD/.
# Latest date auto-detected
bun run compare
# Statistical analysis with bootstrap confidence intervals
bun run compare:stat
# Specific date or filter
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid
bun run compare -- --search-provider builtin
bun run compare -- --trial-type capability
# View results
cat data/comparisons/2026-02-18/all-builtin-weighted.json | jq '.capability'
cat data/comparisons/2026-02-18/builtin-vs-you-weighted.json | jq '.headToHead.capability'
Comparison strategies:
weighted (default) - Capability, reliability, and consistency weighted scoringstatistical - Bootstrap sampling with 95% confidence intervalsGenerate a comprehensive REPORT.md from comparison results:
# Latest date auto-detected
bun run report
# Specific date
bun run report -- --run-date 2026-02-18
# Preview without writing
bun run report -- --dry-run
Report includes:
Output: data/comparisons/YYYY-MM-DD/REPORT.md
Interactive wizard to sample failures and review grader accuracy. Helps distinguish between agent failures (agent got it wrong) and grader bugs (agent was correct, grader too strict).
# Interactive calibration (recommended)
bun run calibrate
Interactive prompts:
Output: data/calibration/{date}-{agent}-{provider}.md
What calibration reveals:
See @agent-eval-harness calibration docs for grader calibration concepts.
The evaluation harness supports two-level parallelization for optimal performance:
-j, --concurrency)Controls how many Docker containers (agent×provider scenarios) run simultaneously.
bun run trials # Unlimited (default, all 8 scenarios at once)
bun run trials -- -j 4 # Limit to 4 containers
bun run trials -- -j 1 # Sequential (debugging)
Use cases:
-j 4 - Limit concurrency if hitting API rate limits-j 2 - Conservative, for low-resource machines-j 1 - Sequential execution for debugging--prompt-concurrency)Controls how many prompts run in parallel within each container.
bun run trials -- --prompt-concurrency 4 # 4 prompts (moderate parallelism)
bun run trials -- --prompt-concurrency 1 # Sequential (default, safest)
bun run trials -- --prompt-concurrency 8 # 8 prompts (high memory, CI only)
How it works:
-j flag with --workspace-dir for isolationPerformance comparison:
| Config | Containers | Prompts/Container | Full (151 prompts, k=5) |
|---|---|---|---|
| Default | unlimited | 1 | ~2.5 hrs |
| Faster | unlimited | 4 | ~40 min |
| CI (high memory) | unlimited | 8 | ~20 min |
Warning: Stream-mode agents (claude-code, droid) use ~400-500MB RSS per prompt process. With --prompt-concurrency 8 that's 3-4GB per container — OOM kills likely in Docker (see issue #45)
Prompts live in a flat data/prompts/ directory. The format differs by search provider:
"Use {server-name} and answer\n{query}" with MCP metadata| File | Prompts | Metadata | Use With |
|---|---|---|---|
prompts.jsonl | 151 | No MCP | SEARCH_PROVIDER=builtin |
prompts-you.jsonl | 151 | mcpServer="ydc-server", expectedTools=["you-search"] | SEARCH_PROVIDER=you |
The entrypoint automatically selects the correct prompt file based on SEARCH_PROVIDER. To run a random subset, pass PROMPT_COUNT (or --count N via CLI):
bun run trials -- --count 5 # 5 random prompts from full dataset
All trial results are written to flat dated directories:
data/results/YYYY-MM-DD/
├── claude-code/
│ ├── builtin.jsonl
│ └── you.jsonl
├── gemini/
├── droid/
└── codex/
Each .jsonl line is a TrialResult:
{"id":"websearch-001","input":"...","k":5,"passRate":0.8,"passAtK":0.999,"passExpK":0.328,"trials":[...]}
Versioning:
git add data/results/ && git commit -m "feat: trial run YYYY-MM-DD"
Compare runs:
bun run compare # Latest date auto-detected
bun run compare -- --run-date 2026-02-18 # Specific date
Create agent-schemas/<agent>.json:
{
"command": ["<agent-cli>", "--flag", "{input}"],
"outputEvents": {
"match": { "path": "$.type" },
"patterns": {
"text": { "value": "text" },
"tool_call": { "value": "tool_call" },
"tool_result": { "value": "tool_result" }
}
},
"result": {
"contentPath": "$.output",
"errorPath": "$.error"
},
"mode": "stream",
"env": ["AGENT_API_KEY"]
}
Key fields:
command - CLI invocation with {input} placeholderoutputEvents.match.path - JSONPath to event type fieldpatterns - Map event types to standard namesresult.contentPath - JSONPath to extract final outputmode - "stream" (persistent) or "iterative" (new process per turn)env - Required environment variablesTest schema:
bunx @plaited/agent-eval-harness adapter:check -- \
bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json
Create docker/<agent>.Dockerfile:
FROM base
# Install agent CLI
RUN npm install -g <agent-cli>
# Copy entrypoint and MCP config
COPY docker/entrypoint /entrypoint
COPY mcp-servers.ts /eval/mcp-servers.ts
RUN chmod +x /entrypoint
CMD ["/entrypoint"]
Verify installation:
docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version
Add to docker-compose.yml:
<agent>:
build:
context: .
dockerfile: docker/<agent>.Dockerfile
volumes:
- ./agent-schemas:/eval/agent-schemas:ro
- ./data:/eval/data
- ./scripts:/eval/scripts:ro
working_dir: /workspace
env_file: .env
environment:
- AGENT=<agent>
- SEARCH_PROVIDER=${SEARCH_PROVIDER:-builtin}
Edit docker/entrypoint to add agent to configureMcp() function:
const configureMcp = async (agent: string, tool: McpServerKey): Promise<void> => {
const server = MCP_SERVERS[tool]
const apiKey = server.auth ? process.env[server.auth.envVar] : undefined
switch (agent) {
// ... existing cases ...
case '<agent>': {
await $`<agent> mcp add ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
console.log('✓ Agent MCP server added')
break
}
}
}
Add timeout if needed in buildTrialsCommand():
switch (AGENT) {
case '<agent>':
cmd.push('--timeout', '120000') // 2 minutes
break
}
Edit scripts/shared/shared.constants.ts to add agent to ALL_AGENTS:
export const ALL_AGENTS: Agent[] = ["claude-code", "gemini", "droid", "codex", "<agent>"]
Also update the Agent type in scripts/shared/shared.types.ts:
type Agent = "claude-code" | "gemini" | "droid" | "codex" | "<agent>"
docker compose build <agent>
docker compose run --rm -e SEARCH_PROVIDER=builtin <agent>
docker compose run --rm -e SEARCH_PROVIDER=you <agent>
export const MCP_SERVERS = {
you: { /* ... existing */ },
exa: {
name: 'exa-server',
type: 'http' as const,
url: 'https://api.exa.ai/mcp',
auth: {
type: 'bearer' as const,
envVar: 'EXA_API_KEY',
},
},
} as const
export type McpServerKey = keyof typeof MCP_SERVERS
Add the new tool case to configureMcp() for each agent:
case 'claude-code': {
await $`claude mcp add --transport http ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'gemini': {
await $`gemini mcp add --transport http --header "Authorization: Bearer ${apiKey}" ${server.name} ${server.url}`.quiet()
break
}
case 'droid': {
await $`droid mcp add ${server.name} ${server.url} --type http --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'codex': {
const configDir = `${process.env.HOME}/.codex`
await $`mkdir -p ${configDir}`.quiet()
const config = `[mcp_servers.${server.name}]
url = "${server.url}"
bearer_token_env_var = "${server.auth?.envVar}"
`
await Bun.write(`${configDir}/config.toml`, config)
break
}
Add to .env and .env.example:
EXA_API_KEY=your_api_key_here
Use the generate-mcp-prompts script to create MCP variant files with proper metadata:
# Generate variants for new MCP server
bun scripts/generate-mcp-prompts.ts --mcp-key exa
# Creates:
# - data/prompts/prompts-exa.jsonl
The script prepends "Use {server-name} and answer\n" to each query and adds MCP metadata (server name and expected tools).
The entrypoint automatically handles provider-specific prompt files:
const promptFile = SEARCH_PROVIDER === "builtin"
? `/eval/data/prompts/prompts.jsonl`
: `/eval/data/prompts/prompts-${SEARCH_PROVIDER}.jsonl` // e.g., prompts-exa.jsonl
Note: scripts/run-trials.ts automatically picks up new MCP servers from mcp-servers.ts, so no manual updates needed.
docker compose build
docker compose run --rm -e SEARCH_PROVIDER=exa claude-code
bun run trials -- --search-provider exa --count 5 -k 1
Current agent schemas:
| Schema | Agent | Mode | Status |
|---|---|---|---|
claude-code.json | Claude Code | stream | ✅ Tested |
gemini.json | Gemini CLI | iterative | ✅ Tested |
droid.json | Droid CLI | stream | ✅ Tested |
codex.json | Codex CLI | stream | ✅ Tested |
Session Modes:
stream - Process stays alive, multi-turn conversations via stdiniterative - New process per turn, history passed as context