원클릭으로
bench
// Run OhMyCode benchmarks — score any provider/model with token tracking. Use when user wants to benchmark, evaluate, test performance, or compare models.
// Run OhMyCode benchmarks — score any provider/model with token tracking. Use when user wants to benchmark, evaluate, test performance, or compare models.
Generate high-quality tests for OhMyCode modules. Use when user wants to create, add, or generate tests for a module or file.
Run tests and analyze results for OhMyCode. Use when user wants to run, check, or verify tests — or after any code change.
Guide for debugging OhMyCode issues. Use when user reports errors, unexpected behavior, or connection problems.
Guide for adding a new feature to OhMyCode. Use when user wants to add functionality that goes beyond existing extension points (tools/providers). Always start by reading docs/DEVELOPMENT_GUIDE.md.
Guide for adding a new LLM provider to OhMyCode. Use when user wants to connect a new AI model backend.
Guide for adding a new tool to OhMyCode. Use when user wants to create a custom tool.
| name | bench |
| description | Run OhMyCode benchmarks — score any provider/model with token tracking. Use when user wants to benchmark, evaluate, test performance, or compare models. |
One-command benchmarking: run 8 SWE-bench-style coding tasks through OhMyCode, track token usage (in/out), and produce a scorecard.
Works with any provider and model — uses whatever is configured in ~/.ohmycode/config.json or overridden via CLI args.
$ARGUMENTS — optional filters and overrides.
| Argument | Example | Effect |
|---|---|---|
| (empty) | /bench | Full suite, current config |
| task filter | /bench fib,bug | Only matching tasks |
--dry-run | /bench --dry-run | Validate task definitions without LLM |
python3 benchmarks/run_bench.py $ARGUMENTS 2>&1 | tee bench_run.log
# Test with a different model
python3 benchmarks/run_bench.py --provider openai --model gpt-4o-mini 2>&1 | tee bench_run.log
# Test with Anthropic
python3 benchmarks/run_bench.py --provider anthropic --model claude-sonnet-4-20250514 2>&1 | tee bench_run.log
# Test with custom endpoint
python3 benchmarks/run_bench.py --base-url http://localhost:8080/v1 --api-key test 2>&1 | tee bench_run.log
The harness outputs:
bench_results.json — machine-readable results for comparisonKey metrics to report:
If any tasks failed:
reason column in the reportbench_results.json for the error field/bench (closed-loop)| # | Task | Category | What It Tests |
|---|---|---|---|
| 1 | fibonacci | code-gen | Create a function from spec |
| 2 | bug-fix-round | bug-fix | Find and fix an off-by-one error |
| 3 | test-generation | test-gen | Write tests for existing code |
| 4 | refactor-preserve | refactor | Improve code without breaking tests |
| 5 | grep-replace | tool-use | Multi-file search and replace |
| 6 | stack-module | code-gen | Create module + tests from scratch |
| 7 | type-error-fix | bug-fix | Fix a TypeError in existing code |
| 8 | code-comprehension | comprehension | Read code and explain the algorithm |
Edit benchmarks/suite.py:
BenchTask(
name="your-task",
category="bug-fix",
prompt="The task description for the agent...",
setup=lambda d: (d / "code.py").write_text("..."), # prepare files
validate=lambda d: (True, "reason"), # check result
max_turns=10,
)
Then append to BENCH_SUITE list.
To compare two models:
python3 benchmarks/run_bench.py --model gpt-4o 2>&1 | tee bench_gpt4o.log
cp bench_results.json bench_gpt4o.json
python3 benchmarks/run_bench.py --model gpt-4o-mini 2>&1 | tee bench_mini.log
cp bench_results.json bench_mini.json
Then compare the JSON files for score and token efficiency.
/run-tests — run unit tests (the benchmark runs these too as Phase 1)/gen-tests — generate tests (task #3 in the benchmark tests this capability)