con un clic
benchmark-audit
// Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.
// Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.
Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.
Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.
Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.
Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.
Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.
Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.
| name | benchmark-audit |
| description | Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity. |
| user-invocable | true |
Audit benchmark suites against the ABC (Agent Benchmark Criteria) framework across three dimensions: Task Validity, Outcome Validity, and Reporting.
Runs scripts/abc_audit.py which:
cd ~/CodeScaleBench && python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --format table
python3 scripts/abc_audit.py --all --format table
python3 scripts/abc_audit.py --suite csb_sdlc_swebenchpro --critical-only
python3 scripts/abc_audit.py --all --format json
python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --dimension task_validity
python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --dimension outcome_validity
For each suite, present:
| Dimension | Grade | Score | Critical Issues | Warnings |
|---|---|---|---|---|
| Task Validity | B+ | 0.85 | 0 | 2 |
| Outcome Validity | A- | 0.90 | 0 | 1 |
| Reporting | B | 0.80 | 1 | 3 |
| Overall | B+ | 0.85 | 1 | 6 |
List any critical issues with their criterion ID and recommended fix.
/score-tasks — Score individual task quality (instruction clarity, verifier quality, reproducibility)/validate-tasks — Pre-flight validation (lighter, focused on "will this run?")