원클릭으로
benchflow
// Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
// Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
| name | benchflow |
| description | Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance. |
| user-invocable | true |
| allowed-tools | ["Read","Write","Edit","Bash"] |
BenchFlow runs AI coding agents against tasks in sandboxed environments and scores their output via ACP (Agent Communication Protocol).
Arguments passed: $ARGUMENTS
status — show current stateuv tool list | grep benchflowbench agent listevaluations/ or jobs/run <task-path> — run a single taskbench eval create \
--tasks-dir <task-path> \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona
Or via Python SDK:
import asyncio
import benchflow as bf
from benchflow import RolloutConfig, Scene
from benchflow._utils.benchmark_repos import resolve_source
async def main():
config = RolloutConfig(
task_path=resolve_source("benchflow-ai/skillsbench", path="tasks/edit-pdf"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
environment="daytona",
)
result = await bf.run(config)
print(f"Reward: {result.rewards}, Tools: {result.n_tool_calls}")
asyncio.run(main())
Note: resolve_source() is required for remote repos in the SDK. The CLI
handles this transparently via --source-repo / --source-path.
API keys are auto-inherited from os.environ into the sandbox.
eval <tasks-dir> — run a benchmark suitebench eval create \
--source-repo benchflow-ai/skillsbench \
--source-path tasks \
--agent gemini \
--model gemini-3.1-flash-lite-preview \
--sandbox daytona \
--concurrency 64
Or via YAML config:
bench eval create --config benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml
YAML format:
source:
repo: benchflow-ai/skillsbench
path: tasks
agent: gemini
model: gemini-3.1-flash-lite-preview
environment: daytona
concurrency: 64
max_retries: 1
metrics <jobs-dir> — analyze resultsbench eval list jobs/
view <rollout-dir> — view a trajectoryResults are in evaluations/<eval-name>/<rollout-name>/ or jobs/<job-name>/<rollout-name>/:
rollout-dir/
├── result.json # rewards, agent, timing
├── prompts.json # prompts sent
├── trajectory/
│ └── acp_trajectory.jsonl # tool calls + agent thoughts
└── verifier/
├── reward.txt # reward value
└── ctrf.json # test results
create-task — create a new benchmark taskbench tasks init my-task
bench tasks init my-task --no-pytest --no-solution
Quick structure:
my-task/
├── task.toml # timeouts, resources, metadata
├── instruction.md # what the agent should do
├── environment/
│ └── Dockerfile # sandbox setup
├── tests/
│ └── test.sh # verifier -> writes to /logs/verifier/reward.txt
└── solution/ # optional reference solution
agents — list available agentsbench agent list
| Agent | Protocol | Auth |
|---|---|---|
gemini | ACP | GEMINI_API_KEY or host login |
claude-agent-acp (alias: claude) | ACP | ANTHROPIC_API_KEY or host login |
codex-acp (alias: codex) | ACP | OPENAI_API_KEY or host login |
opencode | ACP | inferred from model |
openhands (alias: oh) | ACP | LLM_API_KEY |
harvey-lab-harness (alias: harvey-lab) | ACP | Provider key matching model |
Any agent can be prefixed with acpx/ to run via ACPX (https://acpx.sh/):
bench eval create --tasks-dir tasks/edit-pdf --agent acpx/gemini --model gemini-3.1-flash-lite-preview --sandbox daytona
ACPX is a headless ACP client with persistent sessions and crash recovery. The underlying agent's install, env vars, credentials, and skill paths are preserved.
compare — multi-agent comparisonimport asyncio
from benchflow.evaluation import Evaluation
async def main():
for agent_name in ["claude-agent-acp", "gemini", "opencode"]:
eval_obj = Evaluation.from_yaml("benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml")
result = await eval_obj.run()
print(f"{agent_name}: {result.passed}/{result.total} ({result.score:.1%})")
asyncio.run(main())
uv tool install benchflow # or: uv sync --extra dev --locked (from source)
export GEMINI_API_KEY=... # or ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.
export DAYTONA_API_KEY=... # for cloud sandboxes
| Sandbox | Flag | Best for |
|---|---|---|
docker | --sandbox docker | Local dev, small runs (<=10 tasks) |
daytona | --sandbox daytona | Cloud runs with concurrency (needs DAYTONA_API_KEY) |
modal | --sandbox modal | Serverless, high concurrency (needs Modal auth) |
Use daytona for benchmarks. Docker is limited by network exhaustion.
Two approaches for deploying skills:
COPY skills /root/.claude/skills
--skills-dirbench eval create \
--tasks-dir task-dir \
--agent claude-agent-acp \
--sandbox daytona \
--skills-dir skills/ \
--agent-env BENCHFLOW_SKILL_NUDGE=name
Skills are uploaded to /skills/ in the sandbox and symlinked to agent-specific paths.
gemini-3.1-flash-lite-preview for testing. Use Pro/Sonnet for real benchmarks.jobs_dir skips completed tasks.None in prompts list gets replaced with instruction.md content.0.5 to reward.txt).--agent-env GEMINI_API_KEY=... in CLI; SDK auto-inherits from os.environ.Incorporate feedback from an independent code reviewer to improve your solution. The reviewer is a different agent that analyzed your work.
Verify academic citations, detect hallucinated BibTeX entries, repair DOI metadata, and produce normalized bibliography outputs without inventing sources.
Delegate complex coding tasks to a specialist model. Use when facing algorithmic challenges, performance optimization, or tricky debugging that benefits from focused code expertise.
Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
Pre-push branch reviewer — runs lint+typecheck+tests, then fans /code-cleanup, /test-review, /docs-review at the branch diff, merges findings by file
Two-pass subagent sweep for trivial/small refactoring wins — find candidates, then verify each before recommending