Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

benchmark-audit

Name: Benchmark Audit
Author: sourcegraph

// Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.

Ejecutar en Manus

$ git log --oneline --stat

stars:25

forks:3

updated:17 de marzo de 2026, 01:39

SKILL.md

readonly

name	benchmark-audit
description	Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.
user-invocable	true

Benchmark Audit

Audit benchmark suites against the ABC (Agent Benchmark Criteria) framework across three dimensions: Task Validity, Outcome Validity, and Reporting.

What This Does

Runs scripts/abc_audit.py which:

Checks each benchmark task against ABC criteria
Evaluates Task Validity (instructions, metadata, Docker setup)
Evaluates Outcome Validity (verifier quality, determinism, scoring)
Evaluates Reporting (metrics completeness, error handling)
Produces letter grades (A-F) per dimension and overall

Steps

1. Audit a specific suite

cd ~/CodeScaleBench && python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --format table

2. Audit all suites

python3 scripts/abc_audit.py --all --format table

3. Show only critical issues

python3 scripts/abc_audit.py --suite csb_sdlc_swebenchpro --critical-only

4. JSON output for detailed analysis

python3 scripts/abc_audit.py --all --format json

5. Filter by dimension

python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --dimension task_validity
python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --dimension outcome_validity

6. Present findings

For each suite, present:

Dimension	Grade	Score	Critical Issues	Warnings
Task Validity	B+	0.85	0	2
Outcome Validity	A-	0.90	0	1
Reporting	B	0.80	1	3
Overall	B+	0.85	1	6

List any critical issues with their criterion ID and recommended fix.

ABC Criteria Reference

T1: Instructions present and non-empty
T2: task.toml has required fields (time_limit_sec, language, difficulty)
T3: Dockerfile builds successfully
T4: No template placeholders in instructions
T5: Instructions don't leak evaluation methodology
O1: test.sh exists and is executable
O2: Verifier has meaningful assertions (not just exit 0)
O3: Scoring is deterministic (same input → same score)
O4: Partial credit where appropriate
R1: Metrics extraction works (result.json → task_metrics.json)
R2: Error handling doesn't mask failures

Related Skills

/score-tasks — Score individual task quality (instruction clarity, verifier quality, reproducibility)
/validate-tasks — Pre-flight validation (lighter, focused on "will this run?")

related-skills.json

mismo repositorio

archive-run.md

from "sourcegraph/CodeScaleBench"

Archive old completed benchmark runs to save disk space and speed up scans. Triggers on archive runs, clean up runs, disk space, old runs.

2026-03-1725

check-infra.md

from "sourcegraph/CodeScaleBench"

Verify infrastructure readiness before launching benchmark runs — tokens, Docker, disk, credentials. Triggers on check infra, infrastructure check, ready to run, pre-run check.

2026-03-1725

compare-configs.md

from "sourcegraph/CodeScaleBench"

Compare benchmark results across agent configurations (baseline, SG_full). Show where configs diverge. Triggers on compare configs, config comparison, which config wins, MCP impact.

2026-03-1725

cost-report.md

from "sourcegraph/CodeScaleBench"

Token and cost analysis per run, suite, and config. Shows most expensive tasks and config cost comparison. Triggers on cost report, how much did it cost, token usage, spending.

2026-03-1725

generate-report.md

from "sourcegraph/CodeScaleBench"

Generate the aggregate CSB evaluation report from completed Harbor runs. Triggers on generate report, eval report, ccb report, benchmark report.

2026-03-1725

ir-analysis.md

from "sourcegraph/CodeScaleBench"

Compute information retrieval quality metrics (precision, recall, MRR, nDCG, MAP) comparing file retrieval across baseline and MCP configs against ground truth. Triggers on ir analysis, retrieval metrics, file recall, ground truth, search quality.

2026-03-1725

package.json

"author": "sourcegraph"

"repository": "sourcegraph/CodeScaleBench"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Analistas de garantía de calidad de software y probadoresOcupaciones informáticas y matemáticas15-1253L4

name	benchmark-audit
description	Audit benchmark suites against ABC framework (Task/Outcome/Reporting validity). Checks instruction quality, verifier correctness, reproducibility. Triggers on benchmark audit, audit benchmark, abc audit, task validity.
user-invocable	true

Benchmark Audit

Audit benchmark suites against the ABC (Agent Benchmark Criteria) framework across three dimensions: Task Validity, Outcome Validity, and Reporting.

What This Does

Runs scripts/abc_audit.py which:

Checks each benchmark task against ABC criteria
Evaluates Task Validity (instructions, metadata, Docker setup)
Evaluates Outcome Validity (verifier quality, determinism, scoring)
Evaluates Reporting (metrics completeness, error handling)
Produces letter grades (A-F) per dimension and overall

Steps

1. Audit a specific suite

cd ~/CodeScaleBench && python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --format table

2. Audit all suites

python3 scripts/abc_audit.py --all --format table

3. Show only critical issues

python3 scripts/abc_audit.py --suite csb_sdlc_swebenchpro --critical-only

4. JSON output for detailed analysis

python3 scripts/abc_audit.py --all --format json

5. Filter by dimension

python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --dimension task_validity
python3 scripts/abc_audit.py --suite csb_sdlc_pytorch --dimension outcome_validity

6. Present findings

For each suite, present:

Dimension	Grade	Score	Critical Issues	Warnings
Task Validity	B+	0.85	0	2
Outcome Validity	A-	0.90	0	1
Reporting	B	0.80	1	3
Overall	B+	0.85	1	6

List any critical issues with their criterion ID and recommended fix.

ABC Criteria Reference

T1: Instructions present and non-empty
T2: task.toml has required fields (time_limit_sec, language, difficulty)
T3: Dockerfile builds successfully
T4: No template placeholders in instructions
T5: Instructions don't leak evaluation methodology
O1: test.sh exists and is executable
O2: Verifier has meaningful assertions (not just exit 0)
O3: Scoring is deterministic (same input → same score)
O4: Partial credit where appropriate
R1: Metrics extraction works (result.json → task_metrics.json)
R2: Error handling doesn't mask failures

Related Skills

/score-tasks — Score individual task quality (instruction clarity, verifier quality, reproducibility)
/validate-tasks — Pre-flight validation (lighter, focused on "will this run?")

benchmark-audit

Benchmark Audit

What This Does

Steps

1. Audit a specific suite

2. Audit all suites

3. Show only critical issues

4. JSON output for detailed analysis

5. Filter by dimension

6. Present findings

ABC Criteria Reference

Related Skills

Más de este repositorio

Más de este repositorio

Benchmark Audit

What This Does

Steps

1. Audit a specific suite

2. Audit all suites

3. Show only critical issues

4. JSON output for detailed analysis

5. Filter by dimension

6. Present findings

ABC Criteria Reference

Related Skills