mit einem Klick
compare-configs
// Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.
// Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.
Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.
Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.
Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.
Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.
Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.
Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.
| name | compare-configs |
| description | Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact. |
| user-invocable | true |
Compare results between agent configurations to find signal about MCP tool impact.
cd ~/CodeScaleBench && python3 scripts/compare_configs.py --format json
Present the JSON output as markdown tables covering:
Overall pass rates:
| Config | Pass | Total | Rate |
|---|---|---|---|
| baseline | X | Y | Z% |
| SG_full | X | Y | Z% |
Divergence analysis:
Divergent task detail table:
| Suite | Task | baseline | SG_full | Signal |
|---|---|---|---|---|
| csb_sdlc_pytorch | sgt-005 | PASS | PASS | MCP-Full matches |
Focus the narrative on:
The basic divergence analysis treats all SG tasks equally. For deeper MCP insight, run the MCP audit which conditions on actual usage:
python3 scripts/mcp_audit.py --paired-only --json --verbose 2>/dev/null
This separates:
Present the MCP-conditioned reward delta table:
| Group | N | BL Reward | SF Reward | Delta |
|---|---|---|---|---|
| Used-MCP | N | X | Y | +Z% |
| Zero-MCP | N | X | Y | -Z% |
| Light | N | X | Y | Z% |
| Moderate | N | X | Y | Z% |
| Heavy | N | X | Y | Z% |
Key insight: Overall SG vs BL delta is diluted by zero-MCP tasks. Conditioning on usage reveals the true MCP signal. For full analysis, suggest /mcp-audit.
python3 scripts/compare_configs.py --suite csb_sdlc_pytorch --format json
python3 scripts/compare_configs.py --divergent-only --format json
python3 scripts/compare_configs.py --format table