Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

compare-configs

Name: Compare Configs
Author: sourcegraph

// Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.

In Manus ausführen

$ git log --oneline --stat

stars:25

forks:3

updated:17. März 2026 um 01:39

SKILL.md

readonly

name	compare-configs
description	Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.
user-invocable	true

Compare Configs

Compare results between agent configurations to find signal about MCP tool impact.

Steps

1. Run the comparison script

cd ~/CodeScaleBench && python3 scripts/compare_configs.py --format json

2. Parse and present the results

Present the JSON output as markdown tables covering:

Overall pass rates:

Config	Pass	Total	Rate
baseline	X	Y	Z%
SG_full	X	Y	Z%

Divergence analysis:

Tasks where all configs pass (stable)
Tasks where all configs fail (likely task/adapter issue)
Divergent tasks (some pass, some fail) — this is the interesting signal
Tasks where baseline fails but MCP passes — MCP tools are helping
Tasks where MCP fails but baseline passes — MCP tools are hurting

Divergent task detail table:

Suite	Task	baseline	SG_full	Signal
csb_sdlc_pytorch	sgt-005	PASS	PASS	MCP-Full matches

3. Highlight key findings

Focus the narrative on:

Biggest winner: Which config has the highest pass rate
MCP helps: Tasks where only MCP configs pass
MCP hurts: Tasks where only baseline passes (investigate why)
All-fail: Tasks failing everywhere (fix these first, helps all configs)

4. MCP-conditioned analysis (optional, when user asks about MCP specifically)

The basic divergence analysis treats all SG tasks equally. For deeper MCP insight, run the MCP audit which conditions on actual usage:

python3 scripts/mcp_audit.py --paired-only --json --verbose 2>/dev/null

This separates:

Used-MCP tasks: Agent actually called MCP tools. Reward delta reflects true MCP value.
Zero-MCP tasks: MCP available but unused. Reward delta reflects preamble overhead only.
Intensity buckets: Light (1-5 calls), Moderate (6-20), Heavy (20+) — shows dose-response.

Present the MCP-conditioned reward delta table:

Group	N	BL Reward	SF Reward	Delta
Used-MCP	N	X	Y	+Z%
Zero-MCP	N	X	Y	-Z%
Light	N	X	Y	Z%
Moderate	N	X	Y	Z%
Heavy	N	X	Y	Z%

Key insight: Overall SG vs BL delta is diluted by zero-MCP tasks. Conditioning on usage reveals the true MCP signal. For full analysis, suggest /mcp-audit.

Variants

Filter to one suite

python3 scripts/compare_configs.py --suite csb_sdlc_pytorch --format json

Show only divergent tasks

python3 scripts/compare_configs.py --divergent-only --format json

Table format (compact)

python3 scripts/compare_configs.py --format table

related-skills.json

gleiches Repository

archive-run.md

from "sourcegraph/CodeScaleBench"

Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.

2026-03-1725

benchmark-audit.md

from "sourcegraph/CodeScaleBench"

Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.

2026-03-1725

check-infra.md

from "sourcegraph/CodeScaleBench"

Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.

2026-03-1725

cost-report.md

from "sourcegraph/CodeScaleBench"

Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.

2026-03-1725

generate-report.md

from "sourcegraph/CodeScaleBench"

Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.

2026-03-1725

ir-analysis.md

from "sourcegraph/CodeScaleBench"

Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.

2026-03-1725

package.json

"author": "sourcegraph"

"repository": "sourcegraph/CodeScaleBench"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe15-1253L4

name	compare-configs
description	Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.
user-invocable	true

Compare Configs

Compare results between agent configurations to find signal about MCP tool impact.

Steps

1. Run the comparison script

cd ~/CodeScaleBench && python3 scripts/compare_configs.py --format json

2. Parse and present the results

Present the JSON output as markdown tables covering:

Overall pass rates:

Config	Pass	Total	Rate
baseline	X	Y	Z%
SG_full	X	Y	Z%

Divergence analysis:

Tasks where all configs pass (stable)
Tasks where all configs fail (likely task/adapter issue)
Divergent tasks (some pass, some fail) — this is the interesting signal
Tasks where baseline fails but MCP passes — MCP tools are helping
Tasks where MCP fails but baseline passes — MCP tools are hurting

Divergent task detail table:

Suite	Task	baseline	SG_full	Signal
csb_sdlc_pytorch	sgt-005	PASS	PASS	MCP-Full matches

3. Highlight key findings

Focus the narrative on:

Biggest winner: Which config has the highest pass rate
MCP helps: Tasks where only MCP configs pass
MCP hurts: Tasks where only baseline passes (investigate why)
All-fail: Tasks failing everywhere (fix these first, helps all configs)

4. MCP-conditioned analysis (optional, when user asks about MCP specifically)

The basic divergence analysis treats all SG tasks equally. For deeper MCP insight, run the MCP audit which conditions on actual usage:

python3 scripts/mcp_audit.py --paired-only --json --verbose 2>/dev/null

This separates:

Used-MCP tasks: Agent actually called MCP tools. Reward delta reflects true MCP value.
Zero-MCP tasks: MCP available but unused. Reward delta reflects preamble overhead only.
Intensity buckets: Light (1-5 calls), Moderate (6-20), Heavy (20+) — shows dose-response.

Present the MCP-conditioned reward delta table:

Group	N	BL Reward	SF Reward	Delta
Used-MCP	N	X	Y	+Z%
Zero-MCP	N	X	Y	-Z%
Light	N	X	Y	Z%
Moderate	N	X	Y	Z%
Heavy	N	X	Y	Z%

Key insight: Overall SG vs BL delta is diluted by zero-MCP tasks. Conditioning on usage reveals the true MCP signal. For full analysis, suggest /mcp-audit.

Variants

Filter to one suite

python3 scripts/compare_configs.py --suite csb_sdlc_pytorch --format json

Show only divergent tasks

python3 scripts/compare_configs.py --divergent-only --format json

Table format (compact)

python3 scripts/compare_configs.py --format table

compare-configs

Compare Configs

Steps

1. Run the comparison script

2. Parse and present the results

3. Highlight key findings

4. MCP-conditioned analysis (optional, when user asks about MCP specifically)

Variants

Filter to one suite

Show only divergent tasks

Table format (compact)

Mehr aus diesem Repository

Mehr aus diesem Repository

Compare Configs

Steps

1. Run the comparison script

2. Parse and present the results

3. Highlight key findings

4. MCP-conditioned analysis (optional, when user asks about MCP specifically)

Variants

Filter to one suite

Show only divergent tasks

Table format (compact)