원클릭으로
woz-benchmark
// Compare WOZCODE vs vanilla Claude Code on the user's codebase — real cost, turn, and time savings. TRIGGER on "compare woz", "how much does woz save", "benchmark woz", "woz vs claude", "show me savings", or /woz-benchmark.
// Compare WOZCODE vs vanilla Claude Code on the user's codebase — real cost, turn, and time savings. TRIGGER on "compare woz", "how much does woz save", "benchmark woz", "woz vs claude", "show me savings", or /woz-benchmark.
Report a WOZCODE bug. Same backend as /woz-feedback, tagged for bug triage. Session context (current session id, anonymous id, OS, arch, Node version) is auto-attached.
Share feedback about WOZCODE — feature requests, general thoughts, anything that's working or not. For broken-behavior reports use /woz-bug (same backend, bug-tagged).
Compare WOZCODE vs vanilla Claude Code on the user's codebase — real cost, turn, and time savings. TRIGGER on "compare woz", "how much does woz save", "benchmark woz", "woz vs claude", "show me savings", or /woz-benchmark.
Authenticate with the Woz service. Use when the user needs to log in or when authentication is required.
Clear stored Woz credentials and log out.
Show WOZCODE savings report - calls saved, time saved, tokens saved, and lifetime totals.
| name | woz-benchmark |
| description | Compare WOZCODE vs vanilla Claude Code on the user's codebase — real cost, turn, and time savings. TRIGGER on "compare woz", "how much does woz save", "benchmark woz", "woz vs claude", "show me savings", or /woz-benchmark. |
| allowed-tools | Bash(node *), Bash(git *), Bash(ls *), Bash(test *), Bash(mkdir *), Bash(date *), Write, Read |
Run a side-by-side comparison of WOZCODE vs vanilla Claude Code on the user's own codebase. Each prompt runs twice against a fresh copy of the repo with git reset --hard between runs, so the target MUST be a clean git repo.
TRIGGER: "compare woz", "how much does woz save", "benchmark woz", "woz vs claude", "show me the savings", "is woz worth it", or /woz-benchmark.
/woz-login).Ask for all three in ONE short message (< 10 lines). Do not re-explain what the benchmark does — the user already invoked it.
.env)? Skip if the repo is self-contained."Do NOT ask about the model. Default to opus in the YAML config. Only switch to sonnet or haiku if the user volunteers a different choice in their answer (e.g. "use sonnet" or "try it on haiku").
Keep examples OUT of the user message unless they ask for help picking prompts. The user doesn't need 4 bullet points of good-vs-bad prompt examples.
Before writing any config, verify the target is usable:
test -d <target>
git -C <target> rev-parse --git-dir
git -C <target> status --porcelain
If the directory doesn't exist, isn't a git repo, or has uncommitted changes, STOP and tell the user how to fix it (e.g. "please commit or stash your changes — the benchmark resets the repo between runs and would lose your work").
Use the Write tool to create a YAML file at /tmp/woz-benchmark-<timestamp>.yaml (get the timestamp from date +%s). Format:
model: opus
maxTurns: 15
prompts:
- "first prompt from the user"
- "second prompt from the user"
setup:
commands:
- "curl -L https://example.com/dataset.csv -o data/sample.csv"
- "psql $DATABASE_URL -f seed.sql"
Notes:
model: opus. Only use a different model if the user volunteered one in their answer.\".setup: block if the user didn't give any environment setup commands.maxTurns: 15 as a safety cap so a single prompt can't run away.One-line warning: "This'll take several minutes — each prompt runs twice." Then run:
node --no-warnings=ExperimentalWarning ${CLAUDE_PLUGIN_ROOT}/scripts/benchmark.js --target <target> --config <yaml-path> --user-env
--user-env loads the user's project CLAUDE.md hierarchy on BOTH sides. Do NOT pass --screenshots, --codex, --judge, or --trace.
The benchmark prints a detailed text report at the end. Relay the full report to the user, then add a clear, sales-oriented savings summary at the top. Compute the deltas from the report's totals:
💰 WOZCODE Savings Summary
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Cost saved: $X.XX (Y% cheaper)
Tokens saved: X,XXX (Y% fewer)
Turns saved: N (Y% fewer)
Time saved: X min (Y% faster)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Frame the numbers positively. If WOZCODE was slower or more expensive on a specific prompt, call it out honestly but note the aggregate.
Finally, tell the user where the detailed JSON report was saved (the benchmark prints this path).
/tmp — the OS cleans it up.