ワンクリックで
web-search-agent-evals
Development assistant for web search agent evaluations across multiple CLI agents
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
Development assistant for web search agent evaluations across multiple CLI agents
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
SOC 職業分類に基づく
| name | web-search-agent-evals |
| description | Development assistant for web search agent evaluations across multiple CLI agents |
| compatibility | Bun >= 1.2.9 |
Development assistant for running and comparing web search capabilities across CLI agents.
Evaluate 4 agents (Claude Code, Gemini, Droid, Codex) with 2 tools (builtin, You.com MCP) = 8 pairings.
Key Features:
Architecture:
agent-schemas/ - Headless adapter JSON schemasmcp-servers.ts - TypeScript MCP server constantsdocker/entrypoint - Bun shell script for runtime configscripts/ - Type-safe execution and comparison CLI toolsdocker/ - Container infrastructure# Full dataset (151 prompts), k=5 — all 8 agent×provider scenarios
bun run trials
# Quick smoke test (5 random prompts, single trial)
bun run trials -- --count 5 -k 1
# Specific agent or provider
bun run trials -- --agent claude-code --search-provider builtin
bun run trials -- --agent gemini --search-provider you
# Trial type presets
bun run trials -- --trial-type capability # k=10, deep exploration
bun run trials -- --trial-type regression # k=3, fast regression check
# Custom k value
bun run trials -- -k 7
# Control parallelism
bun run trials -- -j 4 # Limit to 4 containers
bun run trials -- --prompt-concurrency 4 # 4 prompts per container
# Direct Docker (manual testing)
docker compose run --rm -e SEARCH_PROVIDER=builtin claude-code
docker compose run --rm -e SEARCH_PROVIDER=you -e PROMPT_COUNT=5 gemini
Comparisons are written to data/comparisons/YYYY-MM-DD/.
# Latest date auto-detected
bun run compare
# Statistical analysis with bootstrap confidence intervals
bun run compare:stat
# Specific date or filter
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid
bun run compare -- --search-provider builtin
bun run compare -- --trial-type capability
# View results
cat data/comparisons/2026-02-18/all-builtin-weighted.json | jq '.capability'
cat data/comparisons/2026-02-18/builtin-vs-you-weighted.json | jq '.headToHead.capability'
Comparison strategies:
weighted (default) - Capability, reliability, and consistency weighted scoringstatistical - Bootstrap sampling with 95% confidence intervalsGenerate a comprehensive REPORT.md from comparison results:
# Latest date auto-detected
bun run report
# Specific date
bun run report -- --run-date 2026-02-18
# Preview without writing
bun run report -- --dry-run
Report includes:
Output: data/comparisons/YYYY-MM-DD/REPORT.md
Interactive wizard to sample failures and review grader accuracy. Helps distinguish between agent failures (agent got it wrong) and grader bugs (agent was correct, grader too strict).
# Interactive calibration (recommended)
bun run calibrate
Interactive prompts:
Output: data/calibration/{date}-{agent}-{provider}.md
What calibration reveals:
See @agent-eval-harness calibration docs for grader calibration concepts.
The evaluation harness supports two-level parallelization for optimal performance:
-j, --concurrency)Controls how many Docker containers (agent×provider scenarios) run simultaneously.
bun run trials # Unlimited (default, all 8 scenarios at once)
bun run trials -- -j 4 # Limit to 4 containers
bun run trials -- -j 1 # Sequential (debugging)
Use cases:
-j 4 - Limit concurrency if hitting API rate limits-j 2 - Conservative, for low-resource machines-j 1 - Sequential execution for debugging--prompt-concurrency)Controls how many prompts run in parallel within each container.
bun run trials -- --prompt-concurrency 4 # 4 prompts (moderate parallelism)
bun run trials -- --prompt-concurrency 1 # Sequential (default, safest)
bun run trials -- --prompt-concurrency 8 # 8 prompts (high memory, CI only)
How it works:
-j flag with --workspace-dir for isolationPerformance comparison:
| Config | Containers | Prompts/Container | Full (151 prompts, k=5) |
|---|---|---|---|
| Default | unlimited | 1 | ~2.5 hrs |
| Faster | unlimited | 4 | ~40 min |
| CI (high memory) | unlimited | 8 | ~20 min |
Warning: Stream-mode agents (claude-code, droid) use ~400-500MB RSS per prompt process. With --prompt-concurrency 8 that's 3-4GB per container — OOM kills likely in Docker (see issue #45)
Prompts live in a flat data/prompts/ directory. The format differs by search provider:
"Use {server-name} and answer\n{query}" with MCP metadata| File | Prompts | Metadata | Use With |
|---|---|---|---|
prompts.jsonl | 151 | No MCP | SEARCH_PROVIDER=builtin |
prompts-you.jsonl | 151 | mcpServer="ydc-server", expectedTools=["you-search"] | SEARCH_PROVIDER=you |
The entrypoint automatically selects the correct prompt file based on SEARCH_PROVIDER. To run a random subset, pass PROMPT_COUNT (or --count N via CLI):
bun run trials -- --count 5 # 5 random prompts from full dataset
All trial results are written to flat dated directories:
data/results/YYYY-MM-DD/
├── claude-code/
│ ├── builtin.jsonl
│ └── you.jsonl
├── gemini/
├── droid/
└── codex/
Each .jsonl line is a TrialResult:
{"id":"websearch-001","input":"...","k":5,"passRate":0.8,"passAtK":0.999,"passExpK":0.328,"trials":[...]}
Versioning:
git add data/results/ && git commit -m "feat: trial run YYYY-MM-DD"
Compare runs:
bun run compare # Latest date auto-detected
bun run compare -- --run-date 2026-02-18 # Specific date
Create agent-schemas/<agent>.json:
{
"command": ["<agent-cli>", "--flag", "{input}"],
"outputEvents": {
"match": { "path": "$.type" },
"patterns": {
"text": { "value": "text" },
"tool_call": { "value": "tool_call" },
"tool_result": { "value": "tool_result" }
}
},
"result": {
"contentPath": "$.output",
"errorPath": "$.error"
},
"mode": "stream",
"env": ["AGENT_API_KEY"]
}
Key fields:
command - CLI invocation with {input} placeholderoutputEvents.match.path - JSONPath to event type fieldpatterns - Map event types to standard namesresult.contentPath - JSONPath to extract final outputmode - "stream" (persistent) or "iterative" (new process per turn)env - Required environment variablesTest schema:
bunx @plaited/agent-eval-harness adapter:check -- \
bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json
Create docker/<agent>.Dockerfile:
FROM base
# Install agent CLI
RUN npm install -g <agent-cli>
# Copy entrypoint and MCP config
COPY docker/entrypoint /entrypoint
COPY mcp-servers.ts /eval/mcp-servers.ts
RUN chmod +x /entrypoint
CMD ["/entrypoint"]
Verify installation:
docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version
Add to docker-compose.yml:
<agent>:
build:
context: .
dockerfile: docker/<agent>.Dockerfile
volumes:
- ./agent-schemas:/eval/agent-schemas:ro
- ./data:/eval/data
- ./scripts:/eval/scripts:ro
working_dir: /workspace
env_file: .env
environment:
- AGENT=<agent>
- SEARCH_PROVIDER=${SEARCH_PROVIDER:-builtin}
Edit docker/entrypoint to add agent to configureMcp() function:
const configureMcp = async (agent: string, tool: McpServerKey): Promise<void> => {
const server = MCP_SERVERS[tool]
const apiKey = server.auth ? process.env[server.auth.envVar] : undefined
switch (agent) {
// ... existing cases ...
case '<agent>': {
await $`<agent> mcp add ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
console.log('✓ Agent MCP server added')
break
}
}
}
Add timeout if needed in buildTrialsCommand():
switch (AGENT) {
case '<agent>':
cmd.push('--timeout', '120000') // 2 minutes
break
}
Edit scripts/shared/shared.constants.ts to add agent to ALL_AGENTS:
export const ALL_AGENTS: Agent[] = ["claude-code", "gemini", "droid", "codex", "<agent>"]
Also update the Agent type in scripts/shared/shared.types.ts:
type Agent = "claude-code" | "gemini" | "droid" | "codex" | "<agent>"
docker compose build <agent>
docker compose run --rm -e SEARCH_PROVIDER=builtin <agent>
docker compose run --rm -e SEARCH_PROVIDER=you <agent>
export const MCP_SERVERS = {
you: { /* ... existing */ },
exa: {
name: 'exa-server',
type: 'http' as const,
url: 'https://api.exa.ai/mcp',
auth: {
type: 'bearer' as const,
envVar: 'EXA_API_KEY',
},
},
} as const
export type McpServerKey = keyof typeof MCP_SERVERS
Add the new tool case to configureMcp() for each agent:
case 'claude-code': {
await $`claude mcp add --transport http ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'gemini': {
await $`gemini mcp add --transport http --header "Authorization: Bearer ${apiKey}" ${server.name} ${server.url}`.quiet()
break
}
case 'droid': {
await $`droid mcp add ${server.name} ${server.url} --type http --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'codex': {
const configDir = `${process.env.HOME}/.codex`
await $`mkdir -p ${configDir}`.quiet()
const config = `[mcp_servers.${server.name}]
url = "${server.url}"
bearer_token_env_var = "${server.auth?.envVar}"
`
await Bun.write(`${configDir}/config.toml`, config)
break
}
Add to .env and .env.example:
EXA_API_KEY=your_api_key_here
Use the generate-mcp-prompts script to create MCP variant files with proper metadata:
# Generate variants for new MCP server
bun scripts/generate-mcp-prompts.ts --mcp-key exa
# Creates:
# - data/prompts/prompts-exa.jsonl
The script prepends "Use {server-name} and answer\n" to each query and adds MCP metadata (server name and expected tools).
The entrypoint automatically handles provider-specific prompt files:
const promptFile = SEARCH_PROVIDER === "builtin"
? `/eval/data/prompts/prompts.jsonl`
: `/eval/data/prompts/prompts-${SEARCH_PROVIDER}.jsonl` // e.g., prompts-exa.jsonl
Note: scripts/run-trials.ts automatically picks up new MCP servers from mcp-servers.ts, so no manual updates needed.
docker compose build
docker compose run --rm -e SEARCH_PROVIDER=exa claude-code
bun run trials -- --search-provider exa --count 5 -k 1
Current agent schemas:
| Schema | Agent | Mode | Status |
|---|---|---|---|
claude-code.json | Claude Code | stream | ✅ Tested |
gemini.json | Gemini CLI | iterative | ✅ Tested |
droid.json | Droid CLI | stream | ✅ Tested |
codex.json | Codex CLI | stream | ✅ Tested |
Session Modes:
stream - Process stays alive, multi-turn conversations via stdiniterative - New process per turn, history passed as contextCLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring.
Discover, create, and validate headless adapters for agent integration. Includes scaffolding tools and schema-driven compliance testing.
Scaffold development rules for AI coding agents. Auto-invoked when user asks about setting up rules, coding conventions, or configuring their AI agent environment.
Web search, AI-powered research with citations, and content extraction for bash agents using You.com's @youdotcom-oss/api CLI. Interactive workflow covers API setup, livecrawl (one-call search+extract), deep-search for cited answers, and schema-driven JSON queries. Faster than built-in search with verifiable references.