Run any Skill in Manus with one click

$pwd:

mcp-audit

Name: Mcp Audit
Author: sourcegraph

// Analyze MCP tool usage patterns, reward/time deltas conditioned on MCP adoption, and zero-MCP investigation. Triggers on mcp audit, mcp analysis, mcp impact, tool usage analysis, did mcp help.

Run Skill in Manus

$ git log --oneline --stat

stars:25

forks:3

updated:March 17, 2026 at 01:39

SKILL.md

readonly

name	mcp-audit
description	Analyze MCP tool usage patterns, reward/time deltas conditioned on MCP adoption, and zero-MCP investigation. Triggers on mcp audit, mcp analysis, mcp impact, tool usage analysis, did mcp help.
user-invocable	true

MCP Audit

Analyze MCP (Sourcegraph) tool usage across benchmark runs to understand where MCP helps, hurts, or goes unused.

What This Does

Runs scripts/mcp_audit.py which:

Collects task_metrics.json from paired_rerun batches (BL + SF on same VM)
Pairs baseline vs sourcegraph_full tasks for fair comparison
Classifies tasks by MCP usage: zero-MCP vs used-MCP (light/moderate/heavy)
Computes reward and time deltas conditioned on actual MCP usage
Identifies negative flips (baseline pass → MCP fail)

Steps

1. Run the MCP audit

cd ~/CodeScaleBench && python3 scripts/mcp_audit.py --json --verbose 2>/dev/null

Save the JSON output for analysis. The script prints progress to stderr and results to stdout.

2. Parse and present key findings

From the JSON output, present these tables:

Overview:

Metric	Value
Total unique tasks	N
Complete BL+SF pairs	N
Used-MCP tasks	N
Zero-MCP tasks	N
Total MCP calls	N

Per-benchmark MCP adoption:

Benchmark	Total	Used MCP	Zero MCP	Zero %

Reward deltas (used-MCP only, cleaner signal):

Group	N	BL Mean	SF Mean	Delta	p-value
Used-MCP	N	X	Y	+Z%	p
Zero-MCP	N	X	Y	-Z%	p
Light (1-5 calls)	N	X	Y	Z%
Moderate (6-20)	N	X	Y	Z%
Heavy (20+)	N	X	Y	Z%

Timing deltas:

Group	BL Mean (s)	SF Mean (s)	Delta

3. Investigate zero-MCP tasks

For each zero-MCP task, classify the reason:

Trivially local: Task requires only local file operations (e.g., DependEval dependency_recognition)
Explicit file list: Instructions specify exact files to examine (e.g., CodeReview)
Full local codebase: Complete codebase available in container (e.g., SWE-Perf)
Both configs failed: Neither baseline nor SG_full succeeded
Agent confusion: MCP available but agent didn't discover/use it (investigate transcript)

For unexplained zero-MCP cases, offer to read the transcript:

# Find the task's transcript
find $(readlink -f runs/official) -path "*sourcegraph_full*" -name "claude-code.txt" | xargs grep -l "TASK_ID_HERE" 2>/dev/null

4. Check for negative flips

List any tasks where baseline passes but SG_full fails (reward regression):

In used-MCP group: Indicates MCP is actively harmful on these tasks
In zero-MCP group: Indicates preamble overhead or non-determinism

5. MCP tool distribution

Show which MCP tools are most/least used:

Tool	Calls	Tasks	Avg/Task
keyword_search	N	N	X
nls_search	N	N	X
read_file	N	N	X
...

6. Summary and recommendations

Synthesize findings into:

MCP value: Where it demonstrably helps (search-heavy benchmarks)
MCP risk: Where it hurts (implementation-heavy, preamble overhead)
Optimization opportunities: Zero-MCP tasks that SHOULD use MCP but don't
Cost-benefit: Is the token/time overhead justified by reward improvement?

Variants

All runs (not just paired reruns)

python3 scripts/mcp_audit.py --all-runs --json --verbose

Text output (human-readable)

python3 scripts/mcp_audit.py --verbose

Save to file

python3 scripts/mcp_audit.py --json --verbose --output docs/MCP_AUDIT_$(date +%Y-%m-%d).md

Key Technical Notes

Transcript-first extraction: Tool counts come from claude-code.txt (includes Task subagent MCP calls), NOT trajectory.json (main-agent only). This was fixed in commit 59cdf7db.
Paired reruns: BL and SF run concurrently on same VM, eliminating load confounds. Prefixed paired_rerun_* in runs/official/.
Valid task filter: Tasks with <10s agent time or 0 output tokens are excluded (auth failures).
MCP tool name variants: Some batches use sg_ prefix (mcp__sourcegraph__sg_keyword_search), others don't. The script handles both.
Zero-MCP != MCP failure: Most zero-MCP tasks rationally chose local tools. Only investigate if the task type suggests MCP should help.

Related Skills

/compare-configs — Binary pass/fail divergence (simpler, doesn't condition on MCP usage)
/evaluate-traces — Comprehensive trace audit (broader scope, includes data integrity)
/cost-report — Token and cost analysis (doesn't pair tasks or condition on MCP)

related-skills.json

same repository

archive-run.md

from "sourcegraph/CodeScaleBench"

Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.

2026-03-1725

benchmark-audit.md

from "sourcegraph/CodeScaleBench"

Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.

2026-03-1725

check-infra.md

from "sourcegraph/CodeScaleBench"

Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.

2026-03-1725

compare-configs.md

from "sourcegraph/CodeScaleBench"

Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.

2026-03-1725

cost-report.md

from "sourcegraph/CodeScaleBench"

Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.

2026-03-1725

generate-report.md

from "sourcegraph/CodeScaleBench"

Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.

2026-03-1725

package.json

"author": "sourcegraph"

"repository": "sourcegraph/CodeScaleBench"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

Software DevelopersL4

name	mcp-audit
description	Analyze MCP tool usage patterns, reward/time deltas conditioned on MCP adoption, and zero-MCP investigation. Triggers on mcp audit, mcp analysis, mcp impact, tool usage analysis, did mcp help.
user-invocable	true

MCP Audit

Analyze MCP (Sourcegraph) tool usage across benchmark runs to understand where MCP helps, hurts, or goes unused.

What This Does

Runs scripts/mcp_audit.py which:

Collects task_metrics.json from paired_rerun batches (BL + SF on same VM)
Pairs baseline vs sourcegraph_full tasks for fair comparison
Classifies tasks by MCP usage: zero-MCP vs used-MCP (light/moderate/heavy)
Computes reward and time deltas conditioned on actual MCP usage
Identifies negative flips (baseline pass → MCP fail)

Steps

1. Run the MCP audit

cd ~/CodeScaleBench && python3 scripts/mcp_audit.py --json --verbose 2>/dev/null

Save the JSON output for analysis. The script prints progress to stderr and results to stdout.

2. Parse and present key findings

From the JSON output, present these tables:

Overview:

Metric	Value
Total unique tasks	N
Complete BL+SF pairs	N
Used-MCP tasks	N
Zero-MCP tasks	N
Total MCP calls	N

Per-benchmark MCP adoption:

Benchmark	Total	Used MCP	Zero MCP	Zero %

Reward deltas (used-MCP only, cleaner signal):

Group	N	BL Mean	SF Mean	Delta	p-value
Used-MCP	N	X	Y	+Z%	p
Zero-MCP	N	X	Y	-Z%	p
Light (1-5 calls)	N	X	Y	Z%
Moderate (6-20)	N	X	Y	Z%
Heavy (20+)	N	X	Y	Z%

Timing deltas:

Group	BL Mean (s)	SF Mean (s)	Delta

3. Investigate zero-MCP tasks

For each zero-MCP task, classify the reason:

Trivially local: Task requires only local file operations (e.g., DependEval dependency_recognition)
Explicit file list: Instructions specify exact files to examine (e.g., CodeReview)
Full local codebase: Complete codebase available in container (e.g., SWE-Perf)
Both configs failed: Neither baseline nor SG_full succeeded
Agent confusion: MCP available but agent didn't discover/use it (investigate transcript)

For unexplained zero-MCP cases, offer to read the transcript:

# Find the task's transcript
find $(readlink -f runs/official) -path "*sourcegraph_full*" -name "claude-code.txt" | xargs grep -l "TASK_ID_HERE" 2>/dev/null

4. Check for negative flips

List any tasks where baseline passes but SG_full fails (reward regression):

In used-MCP group: Indicates MCP is actively harmful on these tasks
In zero-MCP group: Indicates preamble overhead or non-determinism

5. MCP tool distribution

Show which MCP tools are most/least used:

Tool	Calls	Tasks	Avg/Task
keyword_search	N	N	X
nls_search	N	N	X
read_file	N	N	X
...

6. Summary and recommendations

Synthesize findings into:

MCP value: Where it demonstrably helps (search-heavy benchmarks)
MCP risk: Where it hurts (implementation-heavy, preamble overhead)
Optimization opportunities: Zero-MCP tasks that SHOULD use MCP but don't
Cost-benefit: Is the token/time overhead justified by reward improvement?

Variants

All runs (not just paired reruns)

python3 scripts/mcp_audit.py --all-runs --json --verbose

Text output (human-readable)

python3 scripts/mcp_audit.py --verbose

Save to file

python3 scripts/mcp_audit.py --json --verbose --output docs/MCP_AUDIT_$(date +%Y-%m-%d).md

Key Technical Notes

Transcript-first extraction: Tool counts come from claude-code.txt (includes Task subagent MCP calls), NOT trajectory.json (main-agent only). This was fixed in commit 59cdf7db.
Paired reruns: BL and SF run concurrently on same VM, eliminating load confounds. Prefixed paired_rerun_* in runs/official/.
Valid task filter: Tasks with <10s agent time or 0 output tokens are excluded (auth failures).
MCP tool name variants: Some batches use sg_ prefix (mcp__sourcegraph__sg_keyword_search), others don't. The script handles both.
Zero-MCP != MCP failure: Most zero-MCP tasks rationally chose local tools. Only investigate if the task type suggests MCP should help.

Related Skills

/compare-configs — Binary pass/fail divergence (simpler, doesn't condition on MCP usage)
/evaluate-traces — Comprehensive trace audit (broader scope, includes data integrity)
/cost-report — Token and cost analysis (doesn't pair tasks or condition on MCP)

mcp-audit

MCP Audit

What This Does

Steps

1. Run the MCP audit

2. Parse and present key findings

3. Investigate zero-MCP tasks

4. Check for negative flips

5. MCP tool distribution

6. Summary and recommendations

Variants

All runs (not just paired reruns)

Text output (human-readable)

Save to file

Key Technical Notes

Related Skills

More from this repository

More from this repository

MCP Audit

What This Does

Steps

1. Run the MCP audit

2. Parse and present key findings

3. Investigate zero-MCP tasks

4. Check for negative flips

5. MCP tool distribution

6. Summary and recommendations

Variants

All runs (not just paired reruns)

Text output (human-readable)

Save to file

Key Technical Notes

Related Skills