Run any Skill in Manus with one click

$pwd:

cost-benchmark

Name: Cost Benchmark
Author: ruvnet

// Run the corpus benchmark — booster locally, optional Gemini/Sonnet/Opus baselines — and persist a verifiable measured-vs-claimed table

Run Skill in Manus

$ git log --oneline --stat

stars:54,444

forks:6,175

updated:May 5, 2026 at 03:35

SKILL.md

readonly

name	cost-benchmark
description	Run the corpus benchmark — booster locally, optional Gemini/Sonnet/Opus baselines — and persist a verifiable measured-vs-claimed table
argument-hint	[--llm] [--anthropic]
allowed-tools	Bash

Cost Benchmark

Runs scripts/bench.mjs against the structural+adversarial corpus and writes per-case + summary results to docs/benchmarks/runs/. This is the verification gate that backs every measurable claim in cost-booster-edit / cost-booster-route.

When to use

Before publishing a release — verify booster win rate didn't regress.
After expanding bench/booster-corpus.json — confirm new cases route correctly.
When auditing a "claimed upstream" tag — flip it to "verified" once the bench supports it.
On a cost question ("is Sonnet 4.6 cheaper than Opus 4.7 for these tasks?") — re-run with BENCH_ANTHROPIC=1.

Steps

Run the bench from v3/ (where agent-booster resolves):

( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                  # booster only — free, ~85 ms
( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash (cheap)
( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \
     node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                          # + Sonnet 4.6 + Opus 4.7

Inspect the markdown summary printed to stdout. The gate metric is winRate (Tier 1 cases). Adversarial cases are tracked separately as escalationRate.
Persisted output lands at:
- docs/benchmarks/runs/latest.json — pointer to the most recent run
- docs/benchmarks/runs/<ISO-timestamp>.json — historical record
Read it back in subsequent skills (e.g. cost-report step 2 reads latest.json for live tier-spend numbers).

Smoke gates

winRate ≥ 0.80 on Tier 1 cases (smoke step 23). Lower the threshold by editing scripts/smoke.sh.
escalationRate is reported but ungated — adversarial cases are diagnostic.

Env overrides

Env var	Default	Purpose
`BENCH_LLM_BASELINE`	unset	`=1` runs the OpenAI-compat baseline
`BENCH_LLM_MODEL`	`models/gemini-2.0-flash`	Override the OpenAI-compat model
`BENCH_LLM_BASE_URL`	Gemini OpenAI shim	Override endpoint
`BENCH_ANTHROPIC`	unset	`=1` runs Anthropic baseline (Sonnet 4.6 + Opus 4.7)
`BENCH_ANTHROPIC_MODELS`	`claude-sonnet-4-6,claude-opus-4-7`	Comma-separated Claude IDs
`BENCH_OUT`	timestamped file	Override output path
`BENCH_QUIET=1`	unset	Suppress markdown summary

API keys auto-pulled from gcloud secrets (GOOGLE_AI_API_KEY, ANTHROPIC_API_KEY); override with BENCH_LLM_API_KEY / BENCH_ANTHROPIC_API_KEY.

Cross-references

ADR-0002 §"Decision 1" / §"Riskiest assumption" · cost-booster-edit/SKILL.md (verification table consumes this skill's output) · cost-report/SKILL.md step 2 (reads runs/latest.json).

related-skills.json

same repository

github-project-management.md

from "ruvnet/ruflo"

Comprehensive GitHub project management with swarm-coordinated issue tracking, project board automation, and sprint planning

2026-05-2154.4k

github-code-review.md

from "ruvnet/ruflo"

Comprehensive GitHub code review with AI-powered swarm coordination

2026-05-2154.4k

github-multi-repo.md

from "ruvnet/ruflo"

Multi-repository coordination, synchronization, and architecture management with AI swarm orchestration

2026-05-2154.4k

github-release-management.md

from "ruvnet/ruflo"

Comprehensive GitHub release orchestration with AI swarm coordination for automated versioning, testing, deployment, and rollback management

2026-05-2154.4k

github-workflow-automation.md

from "ruvnet/ruflo"

Advanced GitHub Actions workflow automation with AI swarm coordination, intelligent CI/CD pipelines, and comprehensive repository management

2026-05-2154.4k

verification-quality-assurance.md

from "ruvnet/ruflo"

Comprehensive truth scoring, code quality verification, and automatic rollback system with 0.95 accuracy threshold for ensuring high-quality agent outputs and codebase reliability.

2026-05-2154.4k

package.json

"author": "ruvnet"

"repository": "ruvnet/ruflo"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

Cost Benchmark

When to use

Before publishing a release — verify booster win rate didn't regress.

After expanding bench/booster-corpus.json — confirm new cases route correctly.

When auditing a "claimed upstream" tag — flip it to "verified" once the bench supports it.

On a cost question ("is Sonnet 4.6 cheaper than Opus 4.7 for these tasks?") — re-run with BENCH_ANTHROPIC=1.

Steps

Run the bench from v3/ (where agent-booster resolves):

( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                  # booster only — free, ~85 ms
( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash (cheap)
( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \
     node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                          # + Sonnet 4.6 + Opus 4.7

Inspect the markdown summary printed to stdout. The gate metric is winRate (Tier 1 cases). Adversarial cases are tracked separately as escalationRate.

Persisted output lands at:

docs/benchmarks/runs/latest.json — pointer to the most recent run
docs/benchmarks/runs/<ISO-timestamp>.json — historical record

Read it back in subsequent skills (e.g. cost-report step 2 reads latest.json for live tier-spend numbers).

Smoke gates

winRate ≥ 0.80 on Tier 1 cases (smoke step 23). Lower the threshold by editing scripts/smoke.sh.

escalationRate is reported but ungated — adversarial cases are diagnostic.

Env overrides

Env var

Default

Purpose

BENCH_LLM_BASELINE

unset

=1 runs the OpenAI-compat baseline

BENCH_LLM_MODEL

models/gemini-2.0-flash

Override the OpenAI-compat model

BENCH_LLM_BASE_URL

Gemini OpenAI shim

Override endpoint

BENCH_ANTHROPIC

unset

=1 runs Anthropic baseline (Sonnet 4.6 + Opus 4.7)

BENCH_ANTHROPIC_MODELS

claude-sonnet-4-6,claude-opus-4-7

Comma-separated Claude IDs

BENCH_OUT

timestamped file

Override output path

BENCH_QUIET=1

unset

Suppress markdown summary

API keys auto-pulled from gcloud secrets (GOOGLE_AI_API_KEY, ANTHROPIC_API_KEY); override with BENCH_LLM_API_KEY / BENCH_ANTHROPIC_API_KEY.

Cross-references

ADR-0002 §"Decision 1" / §"Riskiest assumption" · cost-booster-edit/SKILL.md (verification table consumes this skill's output) · cost-report/SKILL.md step 2 (reads runs/latest.json).

cost-benchmark

Cost Benchmark

When to use

Steps

Smoke gates

Env overrides

Cross-references

More from this repository

More from this repository

Cost Benchmark

When to use

Steps

Smoke gates

Env overrides

Cross-references