mit einem Klick
sv-report
// Generate SV-Bench metrics reports (summary.json + report.md) for E1/E2 runs, validate metrics contracts, and produce comparison-friendly artifacts from outputs/evals/.
// Generate SV-Bench metrics reports (summary.json + report.md) for E1/E2 runs, validate metrics contracts, and produce comparison-friendly artifacts from outputs/evals/.
| name | sv-report |
| description | Generate SV-Bench metrics reports (summary.json + report.md) for E1/E2 runs, validate metrics contracts, and produce comparison-friendly artifacts from outputs/evals/. |
| metadata | {"author":"security-verifiers","version":"1.0"} |
Generate the WP1 report artifacts for evaluation runs:
summary.json (schema: bench/schemas/summary.schema.json)report.md (human-readable)This skill is for report generation/validation, not running new evals (use sv-eval for that).
.venv/) or your preferred runner.WEAVE_DISABLED=trueWEAVE_DISABLED=true .venv/bin/svbench_report --env e1 --input outputs/evals/sv-env-network-logs--gpt-5-mini/<run_id> --strict
WEAVE_DISABLED=true .venv/bin/svbench_report --env e2 --input outputs/evals/sv-env-config-verification--gpt-5-mini/<run_id> --strict
Outputs are written into the same run directory:
outputs/evals/.../<run_id>/summary.jsonoutputs/evals/.../<run_id>/report.mdGenerate reports for all non-archived runs under outputs/evals/:
.venv/bin/python scripts/generate_svbench_reports.py
Only E1:
.venv/bin/python scripts/generate_svbench_reports.py --env e1 --strict
Only E2:
.venv/bin/python scripts/generate_svbench_reports.py --env e2 --strict
Specific run ids:
.venv/bin/python scripts/generate_svbench_reports.py --run-ids d4e7f897 cb97305e
The Make targets produce comparison-friendly JSON across runs:
make report-network-logs
make report-config-verification
These are intended for quick comparisons / dashboards. The contract-grade per-run artifacts are generated via bench.report / svbench_report.
Run and analyze Security Verifiers evaluations. Use when asked to evaluate models on E1 (network-logs) or E2 (config-verification), generate metrics reports, compare model performance, or analyze eval results.
Deploy Security Verifiers environments and packages. Use when asked to deploy to Prime Intellect Environments Hub, publish to PyPI, bump versions, build wheels, or manage releases.
Build and manage Security Verifiers datasets. Use when asked to build E1 or E2 datasets, create test fixtures, validate data, or manage dataset files for network-logs or config-verification environments.
Development workflow for Security Verifiers. Use when asked to run tests, lint code, format files, set up the development environment, or perform CI checks on the codebase.
Manage HuggingFace datasets for Security Verifiers. Use when asked to push datasets to HuggingFace, manage metadata, configure gated access, or set up user HF repositories for E1/E2 datasets.