Run any Skill in Manus with one click

Get Started

$pwd:

web-search-agent-evals

Name: Web Search Agent Evals
Author: youdotcom-oss

// Development assistant for web search agent evaluations across multiple CLI agents

Run Skill in Manus

$ git log --oneline --stat

stars:5

forks:0

updated:February 19, 2026 at 09:25

SKILL.md

readonly

package.json

"author": "youdotcom-oss"

"repository": "youdotcom-oss/web-search-agent-evals"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	web-search-agent-evals
description	Development assistant for web search agent evaluations across multiple CLI agents
compatibility	Bun >= 1.2.9

Web Search Agent Evaluations

Development assistant for running and comparing web search capabilities across CLI agents.

Overview

Evaluate 4 agents (Claude Code, Gemini, Droid, Codex) with 2 tools (builtin, You.com MCP) = 8 pairings.

Key Features:

Headless adapters - Schema-driven CLI agent execution via @plaited/agent-eval-harness
Flag-based architecture - Single service per agent, mode selected via environment variables
Type-safe constants - MCP server definitions in TypeScript
Isolated execution - Each pairing runs in its own Docker container

Architecture:

agent-schemas/ - Headless adapter JSON schemas
mcp-servers.ts - TypeScript MCP server constants
docker/entrypoint - Bun shell script for runtime config
scripts/ - Type-safe execution and comparison CLI tools
docker/ - Container infrastructure

Quick Commands

Run Evaluations

# Full dataset (151 prompts), k=5 — all 8 agent×provider scenarios
bun run trials

# Quick smoke test (5 random prompts, single trial)
bun run trials -- --count 5 -k 1

# Specific agent or provider
bun run trials -- --agent claude-code --search-provider builtin
bun run trials -- --agent gemini --search-provider you

# Trial type presets
bun run trials -- --trial-type capability   # k=10, deep exploration
bun run trials -- --trial-type regression   # k=3, fast regression check

# Custom k value
bun run trials -- -k 7

# Control parallelism
bun run trials -- -j 4                      # Limit to 4 containers
bun run trials -- --prompt-concurrency 4    # 4 prompts per container

# Direct Docker (manual testing)
docker compose run --rm -e SEARCH_PROVIDER=builtin claude-code
docker compose run --rm -e SEARCH_PROVIDER=you -e PROMPT_COUNT=5 gemini

Compare Results

Comparisons are written to data/comparisons/YYYY-MM-DD/.

# Latest date auto-detected
bun run compare

# Statistical analysis with bootstrap confidence intervals
bun run compare:stat

# Specific date or filter
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid
bun run compare -- --search-provider builtin
bun run compare -- --trial-type capability

# View results
cat data/comparisons/2026-02-18/all-builtin-weighted.json | jq '.capability'
cat data/comparisons/2026-02-18/builtin-vs-you-weighted.json | jq '.headToHead.capability'

Comparison strategies:

weighted (default) - Capability, reliability, and consistency weighted scoring
statistical - Bootstrap sampling with 95% confidence intervals

Generate Report

Generate a comprehensive REPORT.md from comparison results:

# Latest date auto-detected
bun run report

# Specific date
bun run report -- --run-date 2026-02-18

# Preview without writing
bun run report -- --dry-run

Report includes:

Executive summary with best capability, reliability, and performance
Quality rankings with pass@k and pass^k scores
Performance rankings (latency P50/P90/P99)
Flakiness analysis with top flaky prompts
MCP tool impact analysis (builtin vs MCP comparison)
Tool call statistics (P50/P90/P99/mean per provider)
Tool call distribution histograms
Failing prompts list (pass@k = 0%) with query text

Output: data/comparisons/YYYY-MM-DD/REPORT.md

Calibrate Grader

Interactive wizard to sample failures and review grader accuracy. Helps distinguish between agent failures (agent got it wrong) and grader bugs (agent was correct, grader too strict).

# Interactive calibration (recommended)
bun run calibrate

Interactive prompts:

Run date - Select from available dated runs
Agents - Multi-select via numbers or "all"
Search providers - Multi-select via numbers or "all"
Sample count - Number of failures to sample (default: 5)

Output: data/calibration/{date}-{agent}-{provider}.md

What calibration reveals:

❌ Grader too strict - Agent gave correct answer, grader rejected valid paraphrasing
❌ Hint too vague - Grader can't tell good from bad answers
✅ Real failures - Agent genuinely gave wrong/incomplete answer

See @agent-eval-harness calibration docs for grader calibration concepts.

Parallelization

The evaluation harness supports two-level parallelization for optimal performance:

Container-Level Concurrency (`-j`, `--concurrency`)

Controls how many Docker containers (agent×provider scenarios) run simultaneously.

bun run trials              # Unlimited (default, all 8 scenarios at once)
bun run trials -- -j 4     # Limit to 4 containers
bun run trials -- -j 1     # Sequential (debugging)

Use cases:

Unlimited (default) - All scenarios at once, I/O-bound workload handles it fine
-j 4 - Limit concurrency if hitting API rate limits
-j 2 - Conservative, for low-resource machines
-j 1 - Sequential execution for debugging

Prompt-Level Concurrency (`--prompt-concurrency`)

Controls how many prompts run in parallel within each container.

bun run trials -- --prompt-concurrency 4    # 4 prompts (moderate parallelism)
bun run trials -- --prompt-concurrency 1    # Sequential (default, safest)
bun run trials -- --prompt-concurrency 8    # 8 prompts (high memory, CI only)

How it works:

Uses harness -j flag with --workspace-dir for isolation
Each prompt gets its own workspace directory
Web searches are I/O-bound — parallel prompts maximize network bandwidth

Performance comparison:

Config	Containers	Prompts/Container	Full (151 prompts, k=5)
Default	unlimited	1	~2.5 hrs
Faster	unlimited	4	~40 min
CI (high memory)	unlimited	8	~20 min

Warning: Stream-mode agents (claude-code, droid) use ~400-500MB RSS per prompt process. With --prompt-concurrency 8 that's 3-4GB per container — OOM kills likely in Docker (see issue #45)

Prompts

Prompts live in a flat data/prompts/ directory. The format differs by search provider:

Builtin mode: Just the query (e.g., "What are the best free icon libraries...")
MCP mode: "Use {server-name} and answer\n{query}" with MCP metadata

File	Prompts	Metadata	Use With
`prompts.jsonl`	151	No MCP	`SEARCH_PROVIDER=builtin`
`prompts-you.jsonl`	151	`mcpServer="ydc-server"`, `expectedTools=["you-search"]`	`SEARCH_PROVIDER=you`

The entrypoint automatically selects the correct prompt file based on SEARCH_PROVIDER. To run a random subset, pass PROMPT_COUNT (or --count N via CLI):

bun run trials -- --count 5    # 5 random prompts from full dataset

Results

All trial results are written to flat dated directories:

data/results/YYYY-MM-DD/
├── claude-code/
│   ├── builtin.jsonl
│   └── you.jsonl
├── gemini/
├── droid/
└── codex/

Each .jsonl line is a TrialResult:

{"id":"websearch-001","input":"...","k":5,"passRate":0.8,"passAtK":0.999,"passExpK":0.328,"trials":[...]}

Versioning:

git add data/results/ && git commit -m "feat: trial run YYYY-MM-DD"

Compare runs:

bun run compare                             # Latest date auto-detected
bun run compare -- --run-date 2026-02-18   # Specific date

Adding a New Agent

1. Create Headless Adapter Schema

Create agent-schemas/<agent>.json:

{
  "command": ["<agent-cli>", "--flag", "{input}"],
  "outputEvents": {
    "match": { "path": "$.type" },
    "patterns": {
      "text": { "value": "text" },
      "tool_call": { "value": "tool_call" },
      "tool_result": { "value": "tool_result" }
    }
  },
  "result": {
    "contentPath": "$.output",
    "errorPath": "$.error"
  },
  "mode": "stream",
  "env": ["AGENT_API_KEY"]
}

Key fields:

command - CLI invocation with {input} placeholder
outputEvents.match.path - JSONPath to event type field
patterns - Map event types to standard names
result.contentPath - JSONPath to extract final output
mode - "stream" (persistent) or "iterative" (new process per turn)
env - Required environment variables

Test schema:

bunx @plaited/agent-eval-harness adapter:check -- \
  bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json

2. Create Dockerfile

Create docker/<agent>.Dockerfile:

FROM base

# Install agent CLI
RUN npm install -g <agent-cli>

# Copy entrypoint and MCP config
COPY docker/entrypoint /entrypoint
COPY mcp-servers.ts /eval/mcp-servers.ts

RUN chmod +x /entrypoint

CMD ["/entrypoint"]

Verify installation:

docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version

3. Add Docker Compose Service

Add to docker-compose.yml:

<agent>:
  build:
    context: .
    dockerfile: docker/<agent>.Dockerfile
  volumes:
    - ./agent-schemas:/eval/agent-schemas:ro
    - ./data:/eval/data
    - ./scripts:/eval/scripts:ro
  working_dir: /workspace
  env_file: .env
  environment:
    - AGENT=<agent>
    - SEARCH_PROVIDER=${SEARCH_PROVIDER:-builtin}

4. Update TypeScript Entrypoint

Edit docker/entrypoint to add agent to configureMcp() function:

const configureMcp = async (agent: string, tool: McpServerKey): Promise<void> => {
  const server = MCP_SERVERS[tool]
  const apiKey = server.auth ? process.env[server.auth.envVar] : undefined

  switch (agent) {
    // ... existing cases ...

    case '<agent>': {
      await $`<agent> mcp add ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
      console.log('✓ Agent MCP server added')
      break
    }
  }
}

Add timeout if needed in buildTrialsCommand():

switch (AGENT) {
  case '<agent>':
    cmd.push('--timeout', '120000')  // 2 minutes
    break
}

5. Update Scripts

Edit scripts/shared/shared.constants.ts to add agent to ALL_AGENTS:

export const ALL_AGENTS: Agent[] = ["claude-code", "gemini", "droid", "codex", "<agent>"]

Also update the Agent type in scripts/shared/shared.types.ts:

type Agent = "claude-code" | "gemini" | "droid" | "codex" | "<agent>"

6. Test

docker compose build <agent>
docker compose run --rm -e SEARCH_PROVIDER=builtin <agent>
docker compose run --rm -e SEARCH_PROVIDER=you <agent>

Adding a New MCP Tool

1. Add to mcp-servers.ts

export const MCP_SERVERS = {
  you: { /* ... existing */ },
  exa: {
    name: 'exa-server',
    type: 'http' as const,
    url: 'https://api.exa.ai/mcp',
    auth: {
      type: 'bearer' as const,
      envVar: 'EXA_API_KEY',
    },
  },
} as const

export type McpServerKey = keyof typeof MCP_SERVERS

2. Update docker/entrypoint

Add the new tool case to configureMcp() for each agent:

case 'claude-code': {
  await $`claude mcp add --transport http ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
  break
}

case 'gemini': {
  await $`gemini mcp add --transport http --header "Authorization: Bearer ${apiKey}" ${server.name} ${server.url}`.quiet()
  break
}

case 'droid': {
  await $`droid mcp add ${server.name} ${server.url} --type http --header "Authorization: Bearer ${apiKey}"`.quiet()
  break
}

case 'codex': {
  const configDir = `${process.env.HOME}/.codex`
  await $`mkdir -p ${configDir}`.quiet()
  const config = `[mcp_servers.${server.name}]
url = "${server.url}"
bearer_token_env_var = "${server.auth?.envVar}"
`
  await Bun.write(`${configDir}/config.toml`, config)
  break
}

3. Update Environment Files

Add to .env and .env.example:

EXA_API_KEY=your_api_key_here

4. Generate MCP Prompt Sets

Use the generate-mcp-prompts script to create MCP variant files with proper metadata:

# Generate variants for new MCP server
bun scripts/generate-mcp-prompts.ts --mcp-key exa

# Creates:
# - data/prompts/prompts-exa.jsonl

The script prepends "Use {server-name} and answer\n" to each query and adds MCP metadata (server name and expected tools).

The entrypoint automatically handles provider-specific prompt files:

const promptFile = SEARCH_PROVIDER === "builtin"
  ? `/eval/data/prompts/prompts.jsonl`
  : `/eval/data/prompts/prompts-${SEARCH_PROVIDER}.jsonl`  // e.g., prompts-exa.jsonl

Note: scripts/run-trials.ts automatically picks up new MCP servers from mcp-servers.ts, so no manual updates needed.

5. Test

docker compose build
docker compose run --rm -e SEARCH_PROVIDER=exa claude-code
bun run trials -- --search-provider exa --count 5 -k 1

Schema Format Reference

Current agent schemas:

Schema	Agent	Mode	Status
`claude-code.json`	Claude Code	stream	✅ Tested
`gemini.json`	Gemini CLI	iterative	✅ Tested
`droid.json`	Droid CLI	stream	✅ Tested
`codex.json`	Codex CLI	stream	✅ Tested

Session Modes:

stream - Process stays alive, multi-turn conversations via stdin
iterative - New process per turn, history passed as context

Related Skills

@agent-eval-harness - Capture, trials, compare commands
@headless-adapters - Schema creation and validation

name	web-search-agent-evals
description	Development assistant for web search agent evaluations across multiple CLI agents
compatibility	Bun >= 1.2.9

Web Search Agent Evaluations

Development assistant for running and comparing web search capabilities across CLI agents.

Overview

Evaluate 4 agents (Claude Code, Gemini, Droid, Codex) with 2 tools (builtin, You.com MCP) = 8 pairings.

Key Features:

Headless adapters - Schema-driven CLI agent execution via @plaited/agent-eval-harness
Flag-based architecture - Single service per agent, mode selected via environment variables
Type-safe constants - MCP server definitions in TypeScript
Isolated execution - Each pairing runs in its own Docker container

Architecture:

agent-schemas/ - Headless adapter JSON schemas
mcp-servers.ts - TypeScript MCP server constants
docker/entrypoint - Bun shell script for runtime config
scripts/ - Type-safe execution and comparison CLI tools
docker/ - Container infrastructure

Quick Commands

Run Evaluations

# Full dataset (151 prompts), k=5 — all 8 agent×provider scenarios
bun run trials

# Quick smoke test (5 random prompts, single trial)
bun run trials -- --count 5 -k 1

# Specific agent or provider
bun run trials -- --agent claude-code --search-provider builtin
bun run trials -- --agent gemini --search-provider you

# Trial type presets
bun run trials -- --trial-type capability   # k=10, deep exploration
bun run trials -- --trial-type regression   # k=3, fast regression check

# Custom k value
bun run trials -- -k 7

# Control parallelism
bun run trials -- -j 4                      # Limit to 4 containers
bun run trials -- --prompt-concurrency 4    # 4 prompts per container

# Direct Docker (manual testing)
docker compose run --rm -e SEARCH_PROVIDER=builtin claude-code
docker compose run --rm -e SEARCH_PROVIDER=you -e PROMPT_COUNT=5 gemini

Compare Results

Comparisons are written to data/comparisons/YYYY-MM-DD/.

# Latest date auto-detected
bun run compare

# Statistical analysis with bootstrap confidence intervals
bun run compare:stat

# Specific date or filter
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid
bun run compare -- --search-provider builtin
bun run compare -- --trial-type capability

# View results
cat data/comparisons/2026-02-18/all-builtin-weighted.json | jq '.capability'
cat data/comparisons/2026-02-18/builtin-vs-you-weighted.json | jq '.headToHead.capability'

Comparison strategies:

weighted (default) - Capability, reliability, and consistency weighted scoring
statistical - Bootstrap sampling with 95% confidence intervals

Generate Report

Generate a comprehensive REPORT.md from comparison results:

# Latest date auto-detected
bun run report

# Specific date
bun run report -- --run-date 2026-02-18

# Preview without writing
bun run report -- --dry-run

Report includes:

Executive summary with best capability, reliability, and performance
Quality rankings with pass@k and pass^k scores
Performance rankings (latency P50/P90/P99)
Flakiness analysis with top flaky prompts
MCP tool impact analysis (builtin vs MCP comparison)
Tool call statistics (P50/P90/P99/mean per provider)
Tool call distribution histograms
Failing prompts list (pass@k = 0%) with query text

Output: data/comparisons/YYYY-MM-DD/REPORT.md

Calibrate Grader

Interactive wizard to sample failures and review grader accuracy. Helps distinguish between agent failures (agent got it wrong) and grader bugs (agent was correct, grader too strict).

# Interactive calibration (recommended)
bun run calibrate

Interactive prompts:

Run date - Select from available dated runs
Agents - Multi-select via numbers or "all"
Search providers - Multi-select via numbers or "all"
Sample count - Number of failures to sample (default: 5)

Output: data/calibration/{date}-{agent}-{provider}.md

What calibration reveals:

❌ Grader too strict - Agent gave correct answer, grader rejected valid paraphrasing
❌ Hint too vague - Grader can't tell good from bad answers
✅ Real failures - Agent genuinely gave wrong/incomplete answer

See @agent-eval-harness calibration docs for grader calibration concepts.

Parallelization

The evaluation harness supports two-level parallelization for optimal performance:

Container-Level Concurrency (`-j`, `--concurrency`)

Controls how many Docker containers (agent×provider scenarios) run simultaneously.

bun run trials              # Unlimited (default, all 8 scenarios at once)
bun run trials -- -j 4     # Limit to 4 containers
bun run trials -- -j 1     # Sequential (debugging)

Use cases:

Unlimited (default) - All scenarios at once, I/O-bound workload handles it fine
-j 4 - Limit concurrency if hitting API rate limits
-j 2 - Conservative, for low-resource machines
-j 1 - Sequential execution for debugging

Prompt-Level Concurrency (`--prompt-concurrency`)

Controls how many prompts run in parallel within each container.

bun run trials -- --prompt-concurrency 4    # 4 prompts (moderate parallelism)
bun run trials -- --prompt-concurrency 1    # Sequential (default, safest)
bun run trials -- --prompt-concurrency 8    # 8 prompts (high memory, CI only)

How it works:

Uses harness -j flag with --workspace-dir for isolation
Each prompt gets its own workspace directory
Web searches are I/O-bound — parallel prompts maximize network bandwidth

Performance comparison:

Config	Containers	Prompts/Container	Full (151 prompts, k=5)
Default	unlimited	1	~2.5 hrs
Faster	unlimited	4	~40 min
CI (high memory)	unlimited	8	~20 min

Warning: Stream-mode agents (claude-code, droid) use ~400-500MB RSS per prompt process. With --prompt-concurrency 8 that's 3-4GB per container — OOM kills likely in Docker (see issue #45)

Prompts

Prompts live in a flat data/prompts/ directory. The format differs by search provider:

Builtin mode: Just the query (e.g., "What are the best free icon libraries...")
MCP mode: "Use {server-name} and answer\n{query}" with MCP metadata

File	Prompts	Metadata	Use With
`prompts.jsonl`	151	No MCP	`SEARCH_PROVIDER=builtin`
`prompts-you.jsonl`	151	`mcpServer="ydc-server"`, `expectedTools=["you-search"]`	`SEARCH_PROVIDER=you`

The entrypoint automatically selects the correct prompt file based on SEARCH_PROVIDER. To run a random subset, pass PROMPT_COUNT (or --count N via CLI):

bun run trials -- --count 5    # 5 random prompts from full dataset

Results

All trial results are written to flat dated directories:

data/results/YYYY-MM-DD/
├── claude-code/
│   ├── builtin.jsonl
│   └── you.jsonl
├── gemini/
├── droid/
└── codex/

Each .jsonl line is a TrialResult:

{"id":"websearch-001","input":"...","k":5,"passRate":0.8,"passAtK":0.999,"passExpK":0.328,"trials":[...]}

Versioning:

git add data/results/ && git commit -m "feat: trial run YYYY-MM-DD"

Compare runs:

bun run compare                             # Latest date auto-detected
bun run compare -- --run-date 2026-02-18   # Specific date

Adding a New Agent

1. Create Headless Adapter Schema

Create agent-schemas/<agent>.json:

{
  "command": ["<agent-cli>", "--flag", "{input}"],
  "outputEvents": {
    "match": { "path": "$.type" },
    "patterns": {
      "text": { "value": "text" },
      "tool_call": { "value": "tool_call" },
      "tool_result": { "value": "tool_result" }
    }
  },
  "result": {
    "contentPath": "$.output",
    "errorPath": "$.error"
  },
  "mode": "stream",
  "env": ["AGENT_API_KEY"]
}

Key fields:

command - CLI invocation with {input} placeholder
outputEvents.match.path - JSONPath to event type field
patterns - Map event types to standard names
result.contentPath - JSONPath to extract final output
mode - "stream" (persistent) or "iterative" (new process per turn)
env - Required environment variables

Test schema:

bunx @plaited/agent-eval-harness adapter:check -- \
  bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json

2. Create Dockerfile

Create docker/<agent>.Dockerfile:

FROM base

# Install agent CLI
RUN npm install -g <agent-cli>

# Copy entrypoint and MCP config
COPY docker/entrypoint /entrypoint
COPY mcp-servers.ts /eval/mcp-servers.ts

RUN chmod +x /entrypoint

CMD ["/entrypoint"]

Verify installation:

docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version

3. Add Docker Compose Service

Add to docker-compose.yml:

<agent>:
  build:
    context: .
    dockerfile: docker/<agent>.Dockerfile
  volumes:
    - ./agent-schemas:/eval/agent-schemas:ro
    - ./data:/eval/data
    - ./scripts:/eval/scripts:ro
  working_dir: /workspace
  env_file: .env
  environment:
    - AGENT=<agent>
    - SEARCH_PROVIDER=${SEARCH_PROVIDER:-builtin}

4. Update TypeScript Entrypoint

Edit docker/entrypoint to add agent to configureMcp() function:

const configureMcp = async (agent: string, tool: McpServerKey): Promise<void> => {
  const server = MCP_SERVERS[tool]
  const apiKey = server.auth ? process.env[server.auth.envVar] : undefined

  switch (agent) {
    // ... existing cases ...

    case '<agent>': {
      await $`<agent> mcp add ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
      console.log('✓ Agent MCP server added')
      break
    }
  }
}

Add timeout if needed in buildTrialsCommand():

switch (AGENT) {
  case '<agent>':
    cmd.push('--timeout', '120000')  // 2 minutes
    break
}

5. Update Scripts

Edit scripts/shared/shared.constants.ts to add agent to ALL_AGENTS:

export const ALL_AGENTS: Agent[] = ["claude-code", "gemini", "droid", "codex", "<agent>"]

Also update the Agent type in scripts/shared/shared.types.ts:

type Agent = "claude-code" | "gemini" | "droid" | "codex" | "<agent>"

6. Test

docker compose build <agent>
docker compose run --rm -e SEARCH_PROVIDER=builtin <agent>
docker compose run --rm -e SEARCH_PROVIDER=you <agent>

Adding a New MCP Tool

1. Add to mcp-servers.ts

export const MCP_SERVERS = {
  you: { /* ... existing */ },
  exa: {
    name: 'exa-server',
    type: 'http' as const,
    url: 'https://api.exa.ai/mcp',
    auth: {
      type: 'bearer' as const,
      envVar: 'EXA_API_KEY',
    },
  },
} as const

export type McpServerKey = keyof typeof MCP_SERVERS

2. Update docker/entrypoint

Add the new tool case to configureMcp() for each agent:

case 'claude-code': {
  await $`claude mcp add --transport http ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
  break
}

case 'gemini': {
  await $`gemini mcp add --transport http --header "Authorization: Bearer ${apiKey}" ${server.name} ${server.url}`.quiet()
  break
}

case 'droid': {
  await $`droid mcp add ${server.name} ${server.url} --type http --header "Authorization: Bearer ${apiKey}"`.quiet()
  break
}

case 'codex': {
  const configDir = `${process.env.HOME}/.codex`
  await $`mkdir -p ${configDir}`.quiet()
  const config = `[mcp_servers.${server.name}]
url = "${server.url}"
bearer_token_env_var = "${server.auth?.envVar}"
`
  await Bun.write(`${configDir}/config.toml`, config)
  break
}

3. Update Environment Files

Add to .env and .env.example:

EXA_API_KEY=your_api_key_here

4. Generate MCP Prompt Sets

Use the generate-mcp-prompts script to create MCP variant files with proper metadata:

# Generate variants for new MCP server
bun scripts/generate-mcp-prompts.ts --mcp-key exa

# Creates:
# - data/prompts/prompts-exa.jsonl

The script prepends "Use {server-name} and answer\n" to each query and adds MCP metadata (server name and expected tools).

The entrypoint automatically handles provider-specific prompt files:

const promptFile = SEARCH_PROVIDER === "builtin"
  ? `/eval/data/prompts/prompts.jsonl`
  : `/eval/data/prompts/prompts-${SEARCH_PROVIDER}.jsonl`  // e.g., prompts-exa.jsonl

Note: scripts/run-trials.ts automatically picks up new MCP servers from mcp-servers.ts, so no manual updates needed.

5. Test

docker compose build
docker compose run --rm -e SEARCH_PROVIDER=exa claude-code
bun run trials -- --search-provider exa --count 5 -k 1

Schema Format Reference

Current agent schemas:

Schema	Agent	Mode	Status
`claude-code.json`	Claude Code	stream	✅ Tested
`gemini.json`	Gemini CLI	iterative	✅ Tested
`droid.json`	Droid CLI	stream	✅ Tested
`codex.json`	Codex CLI	stream	✅ Tested

Session Modes:

stream - Process stays alive, multi-turn conversations via stdin
iterative - New process per turn, history passed as context

Related Skills

@agent-eval-harness - Capture, trials, compare commands
@headless-adapters - Schema creation and validation

web-search-agent-evals

Web Search Agent Evaluations

Overview

Quick Commands

Run Evaluations

Compare Results

Generate Report

Calibrate Grader

Parallelization

Container-Level Concurrency (-j, --concurrency)

Prompt-Level Concurrency (--prompt-concurrency)

Prompts

Results

Adding a New Agent

1. Create Headless Adapter Schema

2. Create Dockerfile

3. Add Docker Compose Service

4. Update TypeScript Entrypoint

5. Update Scripts

6. Test

Adding a New MCP Tool

1. Add to mcp-servers.ts

2. Update docker/entrypoint

3. Update Environment Files

4. Generate MCP Prompt Sets

5. Test

Schema Format Reference

Related Skills

Web Search Agent Evaluations

Overview

Quick Commands

Run Evaluations

Compare Results

Generate Report

Calibrate Grader

Parallelization

Container-Level Concurrency (-j, --concurrency)

Prompt-Level Concurrency (--prompt-concurrency)

Prompts

Results

Adding a New Agent

1. Create Headless Adapter Schema

2. Create Dockerfile

3. Add Docker Compose Service

4. Update TypeScript Entrypoint

5. Update Scripts

6. Test

Adding a New MCP Tool

1. Add to mcp-servers.ts

2. Update docker/entrypoint

3. Update Environment Files

4. Generate MCP Prompt Sets

5. Test

Schema Format Reference

Related Skills

Container-Level Concurrency (`-j`, `--concurrency`)

Prompt-Level Concurrency (`--prompt-concurrency`)

Container-Level Concurrency (`-j`, `--concurrency`)

Prompt-Level Concurrency (`--prompt-concurrency`)