一键在 Manus 中运行任何 Skill

$pwd:

ir-analysis

Name: Ir Analysis
Author: sourcegraph

// Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.

在 Manus 中运行

$ git log --oneline --stat

stars:25

forks:3

updated:2026年3月17日 01:39

SKILL.md

readonly

name	ir-analysis
description	Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.
user-invocable	true

IR Analysis

Measure how well agents find the right files, comparing baseline (local tools) vs MCP (Sourcegraph) retrieval against per-task ground truth.

What This Does

Runs scripts/ir_analysis.py which:

Loads ground truth files per task from configs/ground_truth_files.json (or builds it from benchmark task dirs)
Parses agent transcripts (agent/claude-code.txt) to extract which files were accessed via tool calls
Computes IR metrics: Precision@K, Recall@K, F1@K, MRR, nDCG@K, MAP, file-level recall, context efficiency
Aggregates by benchmark and config, with statistical significance tests

Steps

1. Ensure ground truth is built

If configs/ground_truth_files.json doesn't exist or needs refreshing:

cd ~/CodeScaleBench && python3 scripts/ir_analysis.py --build-ground-truth

This extracts ground truth files from each benchmark's task structure (patches, diffs, ground_truth dirs, test scripts, instructions). Reports per-benchmark counts and confidence levels.

2. Run the IR analysis

cd ~/CodeScaleBench && python3 scripts/ir_analysis.py --json 2>/dev/null

Or for human-readable table output:

cd ~/CodeScaleBench && python3 scripts/ir_analysis.py 2>/dev/null

3. Parse and present key findings

Per-benchmark IR scores:

Benchmark	Config	N	File Recall	MRR	P@5	R@5	nDCG@5	MAP	Ctx Eff

Overall aggregates:

Config	File Recall	MRR	MAP	Context Efficiency
baseline	X	X	X	X
sourcegraph_full	X	X	X	X

Statistical tests (baseline vs SG_full):

Metric	Welch's t	p-value	Cohen's d	Bootstrap 95% CI

4. Interpret results

Key metrics to focus on:

File recall: Fraction of ground truth files the agent accessed (most important)
MRR: How quickly the agent found the first relevant file (1.0 = first file accessed was relevant)
Context efficiency: Relevant files / total files accessed (higher = less noise)
P@K: Precision at top-K accessed files (were early accesses relevant?)

5. Per-task drill-down (optional)

For detailed per-task scores:

python3 scripts/ir_analysis.py --per-task --json 2>/dev/null

Filter to a specific benchmark:

python3 scripts/ir_analysis.py --suite csb_sdlc_swebenchpro 2>/dev/null

Variants

Build/refresh ground truth only

python3 scripts/ir_analysis.py --build-ground-truth

JSON output for programmatic use

python3 scripts/ir_analysis.py --json > /tmp/ir_results.json

Filter to one benchmark

python3 scripts/ir_analysis.py --suite csb_sdlc_pytorch

Per-task detail

python3 scripts/ir_analysis.py --per-task

Key Technical Notes

Ground truth confidence levels: "high" (from patches/diffs — SWE-bench Pro, PyTorch, K8s Docs), "medium" (from test scripts), "low" (regex from instructions). High-confidence tasks give the most reliable IR metrics.
Transcript parsing: Reads Harbor's nested JSONL format from agent/claude-code.txt. Extracts file paths from Read, Grep, Glob, Write, Edit tool inputs and MCP tool results.
Path normalization: Strips /workspace/ prefix, a//b/ diff notation, lowercases for comparison.
Baseline retrieval: For runs without MCP, "retrieved files" come from local Read/Grep/Glob calls. This measures manual navigation quality vs MCP search quality.
Deduplication: When multiple batches exist for the same task+config, the latest (by started_at timestamp) wins.
Statistical tests: Uses pure-stdlib implementations from csb_metrics/statistics.py — Welch's t-test, Cohen's d, bootstrap CI. No scipy dependency.

Ground Truth Sources

Benchmark	Strategy	Source File	Confidence
SWE-bench Pro	Patch headers	`solve.sh` / `solution/solve.sh`	high
PyTorch	Diff headers	`tests/expected.diff` / `instruction.md`	high
K8s Docs	Directory listing	`ground_truth/`	high
Governance	Test script paths	`tests/test.sh`	medium
Enterprise	Test script paths	`tests/test.sh`	medium
Others	Instruction regex	`instruction.md`	low

Related Skills

/mcp-audit — MCP usage patterns and adoption rates (complements IR quality metrics)
/compare-configs — Binary pass/fail divergence with optional statistical tests
/evaluate-traces — Comprehensive trace audit (broader scope, data integrity focus)

related-skills.json

同仓库

archive-run.md

from "sourcegraph/CodeScaleBench"

Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.

2026-03-1725

benchmark-audit.md

from "sourcegraph/CodeScaleBench"

Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.

2026-03-1725

check-infra.md

from "sourcegraph/CodeScaleBench"

Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.

2026-03-1725

compare-configs.md

from "sourcegraph/CodeScaleBench"

Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.

2026-03-1725

cost-report.md

from "sourcegraph/CodeScaleBench"

Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.

2026-03-1725

generate-report.md

from "sourcegraph/CodeScaleBench"

Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.

2026-03-1725

package.json

"author": "sourcegraph"

"repository": "sourcegraph/CodeScaleBench"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

数据科学家计算机与数学类职业15-2051L4

name	ir-analysis
description	Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.
user-invocable	true

IR Analysis

Measure how well agents find the right files, comparing baseline (local tools) vs MCP (Sourcegraph) retrieval against per-task ground truth.

What This Does

Runs scripts/ir_analysis.py which:

Loads ground truth files per task from configs/ground_truth_files.json (or builds it from benchmark task dirs)
Parses agent transcripts (agent/claude-code.txt) to extract which files were accessed via tool calls
Computes IR metrics: Precision@K, Recall@K, F1@K, MRR, nDCG@K, MAP, file-level recall, context efficiency
Aggregates by benchmark and config, with statistical significance tests

Steps

1. Ensure ground truth is built

If configs/ground_truth_files.json doesn't exist or needs refreshing:

cd ~/CodeScaleBench && python3 scripts/ir_analysis.py --build-ground-truth

This extracts ground truth files from each benchmark's task structure (patches, diffs, ground_truth dirs, test scripts, instructions). Reports per-benchmark counts and confidence levels.

2. Run the IR analysis

cd ~/CodeScaleBench && python3 scripts/ir_analysis.py --json 2>/dev/null

Or for human-readable table output:

cd ~/CodeScaleBench && python3 scripts/ir_analysis.py 2>/dev/null

3. Parse and present key findings

Per-benchmark IR scores:

Benchmark	Config	N	File Recall	MRR	P@5	R@5	nDCG@5	MAP	Ctx Eff

Overall aggregates:

Config	File Recall	MRR	MAP	Context Efficiency
baseline	X	X	X	X
sourcegraph_full	X	X	X	X

Statistical tests (baseline vs SG_full):

Metric	Welch's t	p-value	Cohen's d	Bootstrap 95% CI

4. Interpret results

Key metrics to focus on:

File recall: Fraction of ground truth files the agent accessed (most important)
MRR: How quickly the agent found the first relevant file (1.0 = first file accessed was relevant)
Context efficiency: Relevant files / total files accessed (higher = less noise)
P@K: Precision at top-K accessed files (were early accesses relevant?)

5. Per-task drill-down (optional)

For detailed per-task scores:

python3 scripts/ir_analysis.py --per-task --json 2>/dev/null

Filter to a specific benchmark:

python3 scripts/ir_analysis.py --suite csb_sdlc_swebenchpro 2>/dev/null

Variants

Build/refresh ground truth only

python3 scripts/ir_analysis.py --build-ground-truth

JSON output for programmatic use

python3 scripts/ir_analysis.py --json > /tmp/ir_results.json

Filter to one benchmark

python3 scripts/ir_analysis.py --suite csb_sdlc_pytorch

Per-task detail

python3 scripts/ir_analysis.py --per-task

Key Technical Notes

Ground truth confidence levels: "high" (from patches/diffs — SWE-bench Pro, PyTorch, K8s Docs), "medium" (from test scripts), "low" (regex from instructions). High-confidence tasks give the most reliable IR metrics.
Transcript parsing: Reads Harbor's nested JSONL format from agent/claude-code.txt. Extracts file paths from Read, Grep, Glob, Write, Edit tool inputs and MCP tool results.
Path normalization: Strips /workspace/ prefix, a//b/ diff notation, lowercases for comparison.
Baseline retrieval: For runs without MCP, "retrieved files" come from local Read/Grep/Glob calls. This measures manual navigation quality vs MCP search quality.
Deduplication: When multiple batches exist for the same task+config, the latest (by started_at timestamp) wins.
Statistical tests: Uses pure-stdlib implementations from csb_metrics/statistics.py — Welch's t-test, Cohen's d, bootstrap CI. No scipy dependency.

Ground Truth Sources

Benchmark	Strategy	Source File	Confidence
SWE-bench Pro	Patch headers	`solve.sh` / `solution/solve.sh`	high
PyTorch	Diff headers	`tests/expected.diff` / `instruction.md`	high
K8s Docs	Directory listing	`ground_truth/`	high
Governance	Test script paths	`tests/test.sh`	medium
Enterprise	Test script paths	`tests/test.sh`	medium
Others	Instruction regex	`instruction.md`	low

Related Skills

/mcp-audit — MCP usage patterns and adoption rates (complements IR quality metrics)
/compare-configs — Binary pass/fail divergence with optional statistical tests
/evaluate-traces — Comprehensive trace audit (broader scope, data integrity focus)

ir-analysis

IR Analysis

What This Does

Steps

1. Ensure ground truth is built

2. Run the IR analysis

3. Parse and present key findings

4. Interpret results

5. Per-task drill-down (optional)

Variants

Build/refresh ground truth only

JSON output for programmatic use

Filter to one benchmark

Per-task detail

Key Technical Notes

Ground Truth Sources

Related Skills

同仓库更多 Skills

IR Analysis

What This Does

Steps

1. Ensure ground truth is built

2. Run the IR analysis

3. Parse and present key findings

4. Interpret results

5. Per-task drill-down (optional)

Variants

Build/refresh ground truth only

JSON output for programmatic use

Filter to one benchmark

Per-task detail

Key Technical Notes

Ground Truth Sources

Related Skills

同仓库更多 Skills