بنقرة واحدة
generate-report
// Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.
// Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.
Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.
Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.
Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.
Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.
Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.
Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.
| name | generate-report |
| description | Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report. |
| user-invocable | true |
Generate the aggregate CodeScaleBench evaluation report from completed Harbor runs in runs/official/.
Runs scripts/generate_eval_report.py which:
result.json (and fallback sources)./eval_reports/)eval_report.json — Full structured data (all tasks, metrics, configs)REPORT.md — Human-readable markdown with tables:
echo "=== Completed runs ===" && \
ls runs/official/ 2>/dev/null && \
echo "" && \
echo "=== Task counts per run ===" && \
for run in runs/official/*/; do \
count=$(find "$run" -name "result.json" -path "*/instance_*" -o -name "result.json" -path "*__*" 2>/dev/null | wc -l); \
echo " $(basename $run): $count tasks with results"; \
done
cd ~/CodeScaleBench && \
python3 scripts/generate_eval_report.py \
--runs-dir runs/official/ \
--output-dir ./eval_reports/ \
--selected-tasks ./configs/selected_benchmark_tasks.json
cat ./eval_reports/REPORT.md
Report files written to ./eval_reports/:
- REPORT.md (summary tables)
- eval_report.json (full structured data)
- *.csv (per-table CSV files)
If the user asks for a report on a subset of runs or a specific directory, pass --runs-dir accordingly:
python3 scripts/generate_eval_report.py \
--runs-dir /path/to/specific/runs/ \
--output-dir ./eval_reports/
To skip CSV generation:
python3 scripts/generate_eval_report.py --no-csv
To skip task selection filtering (include ALL discovered tasks, not just canonical):
python3 scripts/generate_eval_report.py --selected-tasks ""