ワンクリックで
benchflow
// Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
// Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
Incorporate feedback from an independent code reviewer to improve your solution. The reviewer is a different agent that analyzed your work.
Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
Verify academic citations, detect hallucinated BibTeX entries, repair DOI metadata, and produce normalized bibliography outputs without inventing sources.
Delegate complex coding tasks to a specialist model. Use when facing algorithmic challenges, performance optimization, or tricky debugging that benefits from focused code expertise.
Pre-push branch reviewer — runs lint+typecheck+tests, then fans /code-cleanup, /test-review, /docs-review at the branch diff, merges findings by file
Two-pass subagent sweep for trivial/small refactoring wins — find candidates, then verify each before recommending
| name | benchflow |
| description | Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance. |
| user-invocable | true |
| allowed-tools | ["Read","Write","Edit","Bash"] |
BenchFlow runs AI coding agents against tasks in sandboxed environments and scores their output. It combines Harbor (environments, verifier) with ACP (multi-turn agent communication).
Arguments passed: $ARGUMENTS
status — show current stateuv tool list | grep benchflow.env exists with API keysbenchflow agentsjobs/run <task-path> — run a single tasksource .env
benchflow run --tasks-dir <task-path> --agent claude-agent-acp --sandbox daytona --model claude-haiku-4-5-20251001
Or via SDK:
import asyncio
from benchflow import SDK
async def main():
sdk = SDK()
result = await sdk.run(
task_path="<task-path>",
agent="claude-agent-acp",
model="claude-haiku-4-5-20251001",
environment="daytona",
)
print(f"Reward: {result.rewards}, Tools: {result.n_tool_calls}")
asyncio.run(main())
API keys are auto-inherited from os.environ. No need to pass agent_env.
job <tasks-dir> — run a benchmark suitebenchflow job --tasks-dir <tasks-dir> --agent claude-agent-acp --sandbox daytona --concurrency 64
Or via YAML config:
benchflow job --config examples/configs/tb2-haiku.yaml
YAML format (benchflow-native):
source:
repo: harbor-framework/terminal-bench-2
jobs_dir: jobs/tb2-haiku
agent: claude-agent-acp
model: claude-haiku-4-5-20251001
environment: daytona
concurrency: 64
max_retries: 1
Harbor-compatible YAML also works:
jobs_dir: jobs
n_attempts: 2
orchestrator:
n_concurrent_trials: 8
environment:
type: daytona
agents:
- name: claude-agent-acp
model_name: anthropic/claude-haiku-4-5-20251001
datasets:
- path: harbor-framework/terminal-bench-2
Multi-turn (adds a recheck prompt):
source:
repo: harbor-framework/terminal-bench-2
jobs_dir: jobs/tb2-multiturn
agent: claude-agent-acp
model: claude-haiku-4-5-20251001
environment: daytona
concurrency: 64
prompts:
- null # uses instruction.md
- "Review your solution. Check for errors, test it, and fix any issues."
metrics <jobs-dir> — analyze resultsbenchflow metrics jobs/tb2-haiku/
benchflow metrics jobs/tb2-haiku/ --json
SDK:
from benchflow import collect_metrics
metrics = collect_metrics("jobs/tb2-haiku", benchmark="TB2", agent="claude-agent-acp")
print(metrics.summary())
view <trial-dir> — view a trajectorybenchflow view jobs/tb2-haiku/<trial-name>/
Opens HTML viewer at http://localhost:8888.
create-task — create a new benchmark taskSee skills/benchflow/references/create-task.md for the full guide.
Quick structure:
my-task/
├── task.toml # timeouts, resources, metadata
├── instruction.md # what the agent should do
├── environment/
│ └── Dockerfile # sandbox setup
├── tests/
│ └── test.sh # verifier → writes to /logs/verifier/reward.txt
└── solution/ # optional reference solution
agents — list available agentsbenchflow agents
| Agent | Status | Skills |
|---|---|---|
claude-agent-acp | Working | ~/.claude/skills/ |
pi-acp | Working | ~/.claude/skills/ |
openclaw | Working (via shim) | copies to <workspace>/skills/ |
codex-acp | Registered | needs OPENAI_API_KEY |
gemini | Registered | needs GOOGLE_API_KEY |
compare — multi-agent comparisonimport asyncio
from benchflow import Job, JobConfig
async def main():
for agent in ["claude-agent-acp", "pi-acp", "openclaw"]:
job = Job(
tasks_dir="path/to/tasks",
jobs_dir=f"jobs/compare-{agent}",
config=JobConfig(agent=agent, environment="daytona", concurrency=64),
)
result = await job.run()
print(f"{agent}: {result.passed}/{result.total} ({result.score:.1%})")
asyncio.run(main())
uv tool install benchflow # or: uv tool install -e . (from source)
source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY
| Environment | Concurrency | Setup |
|---|---|---|
daytona | 64+ | Set DAYTONA_API_KEY in .env |
docker | ~4 | Docker must be running locally |
Use daytona for benchmarks. Docker is limited by network exhaustion.
SkillsBench tasks bake skills into Docker images:
COPY skills /root/.claude/skills
claude-agent-acp / pi-acp: auto-discover ~/.claude/skills/openclaw: shim copies from .claude/skills/ → <workspace>/skills/jobs/{job_name}/{trial_name}/
├── result.json # rewards, agent, timing
├── prompts.json # prompts sent
├── trajectory/
│ └── acp_trajectory.jsonl # tool calls + agent thoughts
└── verifier/
├── reward.txt # reward value
└── ctrf.json # test results
claude-haiku-4-5-20251001 for testing. Use Sonnet for real benchmarks.jobs_dir skips completed tasks.None in prompts list gets replaced with instruction.md content.0.5 to reward.txt).