| name | ir-analysis |
| description | Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality. |
| user-invocable | true |
IR Analysis
Measure how well agents find the right files, comparing baseline (local tools) vs MCP (Sourcegraph) retrieval against per-task ground truth.
What This Does
Runs scripts/ir_analysis.py which:
- Loads ground truth files per task from
configs/ground_truth_files.json (or builds it from benchmark task dirs)
- Parses agent transcripts (
agent/claude-code.txt) to extract which files were accessed via tool calls
- Computes IR metrics: Precision@K, Recall@K, F1@K, MRR, nDCG@K, MAP, file-level recall, context efficiency
- Aggregates by benchmark and config, with statistical significance tests
Steps
1. Ensure ground truth is built
If configs/ground_truth_files.json doesn't exist or needs refreshing:
cd ~/CodeScaleBench && python3 scripts/ir_analysis.py --build-ground-truth
This extracts ground truth files from each benchmark's task structure (patches, diffs, ground_truth dirs, test scripts, instructions). Reports per-benchmark counts and confidence levels.
2. Run the IR analysis
cd ~/CodeScaleBench && python3 scripts/ir_analysis.py --json 2>/dev/null
Or for human-readable table output:
cd ~/CodeScaleBench && python3 scripts/ir_analysis.py 2>/dev/null
3. Parse and present key findings
Per-benchmark IR scores:
| Benchmark | Config | N | File Recall | MRR | P@5 | R@5 | nDCG@5 | MAP | Ctx Eff |
|---|
Overall aggregates:
| Config | File Recall | MRR | MAP | Context Efficiency |
|---|
| baseline | X | X | X | X |
| sourcegraph_full | X | X | X | X |
Statistical tests (baseline vs SG_full):
| Metric | Welch's t | p-value | Cohen's d | Bootstrap 95% CI |
|---|
4. Interpret results
Key metrics to focus on:
- File recall: Fraction of ground truth files the agent accessed (most important)
- MRR: How quickly the agent found the first relevant file (1.0 = first file accessed was relevant)
- Context efficiency: Relevant files / total files accessed (higher = less noise)
- P@K: Precision at top-K accessed files (were early accesses relevant?)
5. Per-task drill-down (optional)
For detailed per-task scores:
python3 scripts/ir_analysis.py --per-task --json 2>/dev/null
Filter to a specific benchmark:
python3 scripts/ir_analysis.py --suite csb_sdlc_swebenchpro 2>/dev/null
Variants
Build/refresh ground truth only
python3 scripts/ir_analysis.py --build-ground-truth
JSON output for programmatic use
python3 scripts/ir_analysis.py --json > /tmp/ir_results.json
Filter to one benchmark
python3 scripts/ir_analysis.py --suite csb_sdlc_pytorch
Per-task detail
python3 scripts/ir_analysis.py --per-task
Key Technical Notes
- Ground truth confidence levels: "high" (from patches/diffs — SWE-bench Pro, PyTorch, K8s Docs), "medium" (from test scripts), "low" (regex from instructions). High-confidence tasks give the most reliable IR metrics.
- Transcript parsing: Reads Harbor's nested JSONL format from
agent/claude-code.txt. Extracts file paths from Read, Grep, Glob, Write, Edit tool inputs and MCP tool results.
- Path normalization: Strips
/workspace/ prefix, a//b/ diff notation, lowercases for comparison.
- Baseline retrieval: For runs without MCP, "retrieved files" come from local Read/Grep/Glob calls. This measures manual navigation quality vs MCP search quality.
- Deduplication: When multiple batches exist for the same task+config, the latest (by
started_at timestamp) wins.
- Statistical tests: Uses pure-stdlib implementations from
csb_metrics/statistics.py — Welch's t-test, Cohen's d, bootstrap CI. No scipy dependency.
Ground Truth Sources
| Benchmark | Strategy | Source File | Confidence |
|---|
| SWE-bench Pro | Patch headers | solve.sh / solution/solve.sh | high |
| PyTorch | Diff headers | tests/expected.diff / instruction.md | high |
| K8s Docs | Directory listing | ground_truth/ | high |
| Governance | Test script paths | tests/test.sh | medium |
| Enterprise | Test script paths | tests/test.sh | medium |
| Others | Instruction regex | instruction.md | low |
Related Skills
/mcp-audit — MCP usage patterns and adoption rates (complements IR quality metrics)
/compare-configs — Binary pass/fail divergence with optional statistical tests
/evaluate-traces — Comprehensive trace audit (broader scope, data integrity focus)