with one click
cost-benchmark
// Run the corpus benchmark — booster locally, optional Gemini/Sonnet/Opus baselines — and persist a verifiable measured-vs-claimed table
// Run the corpus benchmark — booster locally, optional Gemini/Sonnet/Opus baselines — and persist a verifiable measured-vs-claimed table
Comprehensive GitHub project management with swarm-coordinated issue tracking, project board automation, and sprint planning
Comprehensive GitHub code review with AI-powered swarm coordination
Multi-repository coordination, synchronization, and architecture management with AI swarm orchestration
Comprehensive GitHub release orchestration with AI swarm coordination for automated versioning, testing, deployment, and rollback management
Advanced GitHub Actions workflow automation with AI swarm coordination, intelligent CI/CD pipelines, and comprehensive repository management
Comprehensive truth scoring, code quality verification, and automatic rollback system with 0.95 accuracy threshold for ensuring high-quality agent outputs and codebase reliability.
| name | cost-benchmark |
| description | Run the corpus benchmark — booster locally, optional Gemini/Sonnet/Opus baselines — and persist a verifiable measured-vs-claimed table |
| argument-hint | [--llm] [--anthropic] |
| allowed-tools | Bash |
Runs scripts/bench.mjs against the structural+adversarial corpus and writes per-case + summary results to docs/benchmarks/runs/. This is the verification gate that backs every measurable claim in cost-booster-edit / cost-booster-route.
bench/booster-corpus.json — confirm new cases route correctly.BENCH_ANTHROPIC=1.Run the bench from v3/ (where agent-booster resolves):
( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # booster only — free, ~85 ms
( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash (cheap)
( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \
node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Sonnet 4.6 + Opus 4.7
Inspect the markdown summary printed to stdout. The gate metric is winRate (Tier 1 cases). Adversarial cases are tracked separately as escalationRate.
Persisted output lands at:
docs/benchmarks/runs/latest.json — pointer to the most recent rundocs/benchmarks/runs/<ISO-timestamp>.json — historical recordRead it back in subsequent skills (e.g. cost-report step 2 reads latest.json for live tier-spend numbers).
winRate ≥ 0.80 on Tier 1 cases (smoke step 23). Lower the threshold by editing scripts/smoke.sh.escalationRate is reported but ungated — adversarial cases are diagnostic.| Env var | Default | Purpose |
|---|---|---|
BENCH_LLM_BASELINE | unset | =1 runs the OpenAI-compat baseline |
BENCH_LLM_MODEL | models/gemini-2.0-flash | Override the OpenAI-compat model |
BENCH_LLM_BASE_URL | Gemini OpenAI shim | Override endpoint |
BENCH_ANTHROPIC | unset | =1 runs Anthropic baseline (Sonnet 4.6 + Opus 4.7) |
BENCH_ANTHROPIC_MODELS | claude-sonnet-4-6,claude-opus-4-7 | Comma-separated Claude IDs |
BENCH_OUT | timestamped file | Override output path |
BENCH_QUIET=1 | unset | Suppress markdown summary |
API keys auto-pulled from gcloud secrets (GOOGLE_AI_API_KEY, ANTHROPIC_API_KEY); override with BENCH_LLM_API_KEY / BENCH_ANTHROPIC_API_KEY.
ADR-0002 §"Decision 1" / §"Riskiest assumption" · cost-booster-edit/SKILL.md (verification table consumes this skill's output) · cost-report/SKILL.md step 2 (reads runs/latest.json).