一键导入
sv-eval
// Run and analyze Security Verifiers evaluations. Use when asked to evaluate models on E1 (network-logs) or E2 (config-verification), generate metrics reports, compare model performance, or analyze eval results.
// Run and analyze Security Verifiers evaluations. Use when asked to evaluate models on E1 (network-logs) or E2 (config-verification), generate metrics reports, compare model performance, or analyze eval results.
| name | sv-eval |
| description | Run and analyze Security Verifiers evaluations. Use when asked to evaluate models on E1 (network-logs) or E2 (config-verification), generate metrics reports, compare model performance, or analyze eval results. |
| metadata | {"author":"security-verifiers","version":"1.0"} |
Run reproducible evaluations on E1 (network-logs) and E2 (config-verification) environments, generate metrics reports, and analyze results.
Ensure environment variables are set (check .env file):
OPENAI_API_KEY - For OpenAI models (gpt-*)OPENROUTER_API_KEY - For non-OpenAI models via OpenRouterWANDB_API_KEY - For Weave logging (optional)Single-turn anomaly detection with calibration and asymmetric costs.
# Basic eval (10 examples)
make eval-e1 MODELS="gpt-5-mini" N=10
# Multiple models
make eval-e1 MODELS="gpt-5-mini,gpt-4.1-mini,qwen3-14b" N=100
# Full production dataset
make eval-e1 MODELS="gpt-5-mini" N=1800 DATASET="iot23-train-dev-test-v1.jsonl"
# OOD datasets
make eval-e1 MODELS="gpt-5-mini" N=600 DATASET="cic-ids-2017-ood-v1.jsonl"
make eval-e1 MODELS="gpt-5-mini" N=600 DATASET="unsw-nb15-ood-v1.jsonl"
Multi-turn tool-grounded auditing with KubeLinter, Semgrep, and OPA.
# Basic eval with tools (2 examples)
make eval-e2 MODELS="gpt-5-mini" N=2 INCLUDE_TOOLS=true
# Without tools (single-turn)
make eval-e2 MODELS="gpt-5-mini" N=10 INCLUDE_TOOLS=false
# Dataset options
make eval-e2 MODELS="gpt-5-mini" N=50 DATASET="k8s-labeled-v1.jsonl"
make eval-e2 MODELS="gpt-5-mini" N=50 DATASET="terraform-labeled-v1.jsonl"
make eval-e2 MODELS="gpt-5-mini" N=100 DATASET="combined" # default
Early stopping prevents wasted API costs:
# Stop after 5 consecutive errors (default: 3)
make eval-e1 MODELS="gpt-5-mini" N=100 MAX_CONSECUTIVE_ERRORS=5
# Disable early stopping
make eval-e1 MODELS="gpt-5-mini" N=100 MAX_CONSECUTIVE_ERRORS=0
SV-Bench supports two kinds of reporting:
summary.json (schema-stable) + report.md (human readable)report-*.json aggregations for quick comparisons# E1
WEAVE_DISABLED=true .venv/bin/svbench_report --env e1 --input outputs/evals/sv-env-network-logs--gpt-5-mini/<run_id> --strict
# E2
WEAVE_DISABLED=true .venv/bin/svbench_report --env e2 --input outputs/evals/sv-env-config-verification--gpt-5-mini/<run_id> --strict
# All non-archived runs
make report-network-logs
# Specific runs
make report-network-logs RUN_IDS="run_abc123 run_def456"
# Custom output path
make report-network-logs OUTPUT="reports/e1-comparison.json"
make report-config-verification
make report-config-verification RUN_IDS="run_abc123"
outputs/evals/sv-env-{name}--{model}/{run_id}/
├── metadata.json # Run config, versions, git SHA
├── results.jsonl # Per-example results
├── summary.json # SV-Bench schema summary (contract-grade)
└── report.md # Human-readable report
Note: OpenRouter models create nested directories:
outputs/evals/sv-env-network-logs--qwen/qwen-2.5-7b-instruct/{run_id}/
outputs/evals/sv-env-network-logs--meta-llama/llama-3.1-8b-instruct/{run_id}/
The report scripts use recursive glob to find all runs regardless of nesting depth.
| Metric | Description | Target |
|---|---|---|
| Accuracy | Overall classification accuracy | Higher is better |
| ECE | Expected Calibration Error | Lower is better |
| FN% | False negative rate (missed threats) | Minimize |
| FP% | False positive rate | Minimize |
| Abstain% | Abstention rate | Context-dependent |
| Metric | Description | Target |
|---|---|---|
| MeanReward | Average episode reward | Higher is better |
| FormatSuccess% | Valid JSON output rate | 100% |
| AvgTools | Tool calls per episode | Lower is efficient |
| AvgTurns | Turns per episode | Lower is efficient |
Model names are auto-resolved via scripts/model_router.py:
gpt-5-mini, gpt-4.1-mini, o1-miniqwen3-14b → qwen/qwen3-14b, llama-3.1-8b → meta-llama/llama-3.1-8b-instructmake eval-e1 MODELS="gpt-5-mini,gpt-4.1-mini,qwen3-14b" N=500
make report-network-logs
Rate limits: Reduce N or use MAX_CONSECUTIVE_ERRORS.
Missing API key: Check .env has correct key for model provider.
Model not found: Use full OpenRouter path (e.g., openai/gpt-5-mini).
If a model shows 0% format success, common causes:
Markdown code blocks: Many models (gpt-4o-mini, qwen, llama) wrap JSON in ```json...``` blocks. The parsers handle this automatically via extract_json_from_markdown() in sv_shared/parsers.py.
Model outputs prose instead of JSON: Some models explain what they would do rather than returning the required JSON format. This is model behavior, not a parsing bug. Check results.jsonl completions.
Empty prompts: If prompt field in results.jsonl is empty, the dataset loading may have failed. Check that:
question field)environments/sv-env-*/data/HF_TOKEN is set for Hub access# Check a specific result file
cat outputs/evals/sv-env-*--{model}/{run_id}/results.jsonl | head -1 | python -m json.tool
# Look for empty prompts (indicates dataset loading issue)
grep '"prompt": ""' outputs/evals/sv-env-*--{model}/{run_id}/results.jsonl
# Check completion format
cat outputs/evals/sv-env-*--{model}/{run_id}/results.jsonl | python -c "
import json, sys
for line in sys.stdin:
r = json.loads(line)
print(f'Index {r[\"index\"]}: format_reward={r[\"rewards\"][\"format_reward\"]}')
print(f' completion: {repr(r[\"completion\"][:100])}...')
"
Generate SV-Bench metrics reports (summary.json + report.md) for E1/E2 runs, validate metrics contracts, and produce comparison-friendly artifacts from outputs/evals/.
Deploy Security Verifiers environments and packages. Use when asked to deploy to Prime Intellect Environments Hub, publish to PyPI, bump versions, build wheels, or manage releases.
Build and manage Security Verifiers datasets. Use when asked to build E1 or E2 datasets, create test fixtures, validate data, or manage dataset files for network-logs or config-verification environments.
Development workflow for Security Verifiers. Use when asked to run tests, lint code, format files, set up the development environment, or perform CI checks on the codebase.
Manage HuggingFace datasets for Security Verifiers. Use when asked to push datasets to HuggingFace, manage metadata, configure gated access, or set up user HF repositories for E1/E2 datasets.