with one click
mcp-audit
// Analyze MCP tool usage patterns, reward/time deltas conditioned on MCP adoption, and zero-MCP investigation. Triggers on mcp audit, mcp analysis, mcp impact, tool usage analysis, did mcp help.
// Analyze MCP tool usage patterns, reward/time deltas conditioned on MCP adoption, and zero-MCP investigation. Triggers on mcp audit, mcp analysis, mcp impact, tool usage analysis, did mcp help.
Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.
Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.
Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.
Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.
Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.
Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.
| name | mcp-audit |
| description | Analyze MCP tool usage patterns, reward/time deltas conditioned on MCP adoption, and zero-MCP investigation. Triggers on mcp audit, mcp analysis, mcp impact, tool usage analysis, did mcp help. |
| user-invocable | true |
Analyze MCP (Sourcegraph) tool usage across benchmark runs to understand where MCP helps, hurts, or goes unused.
Runs scripts/mcp_audit.py which:
task_metrics.json from paired_rerun batches (BL + SF on same VM)cd ~/CodeScaleBench && python3 scripts/mcp_audit.py --json --verbose 2>/dev/null
Save the JSON output for analysis. The script prints progress to stderr and results to stdout.
From the JSON output, present these tables:
Overview:
| Metric | Value |
|---|---|
| Total unique tasks | N |
| Complete BL+SF pairs | N |
| Used-MCP tasks | N |
| Zero-MCP tasks | N |
| Total MCP calls | N |
Per-benchmark MCP adoption:
| Benchmark | Total | Used MCP | Zero MCP | Zero % |
|---|
Reward deltas (used-MCP only, cleaner signal):
| Group | N | BL Mean | SF Mean | Delta | p-value |
|---|---|---|---|---|---|
| Used-MCP | N | X | Y | +Z% | p |
| Zero-MCP | N | X | Y | -Z% | p |
| Light (1-5 calls) | N | X | Y | Z% | |
| Moderate (6-20) | N | X | Y | Z% | |
| Heavy (20+) | N | X | Y | Z% |
Timing deltas:
| Group | BL Mean (s) | SF Mean (s) | Delta |
|---|
For each zero-MCP task, classify the reason:
For unexplained zero-MCP cases, offer to read the transcript:
# Find the task's transcript
find $(readlink -f runs/official) -path "*sourcegraph_full*" -name "claude-code.txt" | xargs grep -l "TASK_ID_HERE" 2>/dev/null
List any tasks where baseline passes but SG_full fails (reward regression):
Show which MCP tools are most/least used:
| Tool | Calls | Tasks | Avg/Task |
|---|---|---|---|
| keyword_search | N | N | X |
| nls_search | N | N | X |
| read_file | N | N | X |
| ... |
Synthesize findings into:
python3 scripts/mcp_audit.py --all-runs --json --verbose
python3 scripts/mcp_audit.py --verbose
python3 scripts/mcp_audit.py --json --verbose --output docs/MCP_AUDIT_$(date +%Y-%m-%d).md
claude-code.txt (includes Task subagent MCP calls), NOT trajectory.json (main-agent only). This was fixed in commit 59cdf7db.paired_rerun_* in runs/official/.sg_ prefix (mcp__sourcegraph__sg_keyword_search), others don't. The script handles both./compare-configs — Binary pass/fail divergence (simpler, doesn't condition on MCP usage)/evaluate-traces — Comprehensive trace audit (broader scope, includes data integrity)/cost-report — Token and cost analysis (doesn't pair tasks or condition on MCP)