ワンクリックでManusで任意のスキルを実行

$pwd:

benchflow

Name: Benchflow
Author: benchflow-ai

// Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.

Manusで実行

$ git log --oneline --stat

stars:242

forks:29

updated:2026年5月17日 08:54

ファイルエクスプローラー

2 ファイル

SKILL.md

readonly

related-skills.json

同じリポジトリ

independent-review.md

from "benchflow-ai/benchflow"

Incorporate feedback from an independent code reviewer to improve your solution. The reviewer is a different agent that analyzed your work.

2026-05-26242

benchflow.md

from "benchflow-ai/benchflow"

Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.

2026-05-19242

citation-management.md

from "benchflow-ai/benchflow"

Verify academic citations, detect hallucinated BibTeX entries, repair DOI metadata, and produce normalized bibliography outputs without inventing sources.

2026-05-19242

code-specialist.md

from "benchflow-ai/benchflow"

Delegate complex coding tasks to a specialist model. Use when facing algorithmic challenges, performance optimization, or tricky debugging that benefits from focused code expertise.

2026-05-17242

branch-review.md

from "benchflow-ai/benchflow"

Pre-push branch reviewer — runs lint+typecheck+tests, then fans /code-cleanup, /test-review, /docs-review at the branch diff, merges findings by file

2026-05-15242

code-cleanup.md

from "benchflow-ai/benchflow"

Two-pass subagent sweep for trivial/small refactoring wins — find candidates, then verify each before recommending

2026-05-15242

package.json

"author": "benchflow-ai"

"repository": "benchflow-ai/benchflow"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

ソフトウェア品質保証アナリスト・テスターコンピュータ・数学職15-1253L4

name	benchflow
description	Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
user-invocable	true
allowed-tools	["Read","Write","Edit","Bash"]

BenchFlow — Agent Benchmarking

BenchFlow runs AI coding agents against tasks in sandboxed environments and scores their output. It combines Harbor (environments, verifier) with ACP (multi-turn agent communication).

Arguments passed: $ARGUMENTS

Dispatch on arguments

No args or `status` — show current state

Check if benchflow is installed: uv tool list | grep benchflow
Check if .env exists with API keys
Check available agents: benchflow agents
Show recent job results if any exist in jobs/
Point to next action based on state

`run <task-path>` — run a single task

source .env
benchflow run --tasks-dir <task-path> --agent claude-agent-acp --sandbox daytona --model claude-haiku-4-5-20251001

Or via SDK:

import asyncio
from benchflow import SDK

async def main():
    sdk = SDK()
    result = await sdk.run(
        task_path="<task-path>",
        agent="claude-agent-acp",
        model="claude-haiku-4-5-20251001",
        environment="daytona",
    )
    print(f"Reward: {result.rewards}, Tools: {result.n_tool_calls}")

asyncio.run(main())

API keys are auto-inherited from os.environ. No need to pass agent_env.

`job <tasks-dir>` — run a benchmark suite

benchflow job --tasks-dir <tasks-dir> --agent claude-agent-acp --sandbox daytona --concurrency 64

Or via YAML config:

benchflow job --config examples/configs/tb2-haiku.yaml

YAML format (benchflow-native):

source:
  repo: harbor-framework/terminal-bench-2
jobs_dir: jobs/tb2-haiku
agent: claude-agent-acp
model: claude-haiku-4-5-20251001
environment: daytona
concurrency: 64
max_retries: 1

Harbor-compatible YAML also works:

jobs_dir: jobs
n_attempts: 2
orchestrator:
  n_concurrent_trials: 8
environment:
  type: daytona
agents:
  - name: claude-agent-acp
    model_name: anthropic/claude-haiku-4-5-20251001
datasets:
  - path: harbor-framework/terminal-bench-2

Multi-turn (adds a recheck prompt):

source:
  repo: harbor-framework/terminal-bench-2
jobs_dir: jobs/tb2-multiturn
agent: claude-agent-acp
model: claude-haiku-4-5-20251001
environment: daytona
concurrency: 64
prompts:
  - null  # uses instruction.md
  - "Review your solution. Check for errors, test it, and fix any issues."

`metrics <jobs-dir>` — analyze results

benchflow metrics jobs/tb2-haiku/
benchflow metrics jobs/tb2-haiku/ --json

SDK:

from benchflow import collect_metrics
metrics = collect_metrics("jobs/tb2-haiku", benchmark="TB2", agent="claude-agent-acp")
print(metrics.summary())

`view <trial-dir>` — view a trajectory

benchflow view jobs/tb2-haiku/<trial-name>/

Opens HTML viewer at http://localhost:8888.

`create-task` — create a new benchmark task

See skills/benchflow/references/create-task.md for the full guide.

Quick structure:

my-task/
├── task.toml          # timeouts, resources, metadata
├── instruction.md     # what the agent should do
├── environment/
│   └── Dockerfile     # sandbox setup
├── tests/
│   └── test.sh        # verifier → writes to /logs/verifier/reward.txt
└── solution/          # optional reference solution

`agents` — list available agents

benchflow agents

Agent	Status	Skills
`claude-agent-acp`	Working	`~/.claude/skills/`
`pi-acp`	Working	`~/.claude/skills/`
`openclaw`	Working (via shim)	copies to `<workspace>/skills/`
`codex-acp`	Registered	needs OPENAI_API_KEY
`gemini`	Registered	needs GOOGLE_API_KEY

`compare` — multi-agent comparison

import asyncio
from benchflow import Job, JobConfig

async def main():
    for agent in ["claude-agent-acp", "pi-acp", "openclaw"]:
        job = Job(
            tasks_dir="path/to/tasks",
            jobs_dir=f"jobs/compare-{agent}",
            config=JobConfig(agent=agent, environment="daytona", concurrency=64),
        )
        result = await job.run()
        print(f"{agent}: {result.passed}/{result.total} ({result.score:.1%})")

asyncio.run(main())

Setup

uv tool install benchflow    # or: uv tool install -e . (from source)
source .env              # ANTHROPIC_API_KEY, DAYTONA_API_KEY

Environments

Environment	Concurrency	Setup
`daytona`	64+	Set `DAYTONA_API_KEY` in `.env`
`docker`	~4	Docker must be running locally

Use daytona for benchmarks. Docker is limited by network exhaustion.

Skills in tasks

SkillsBench tasks bake skills into Docker images:

COPY skills /root/.claude/skills

claude-agent-acp / pi-acp: auto-discover ~/.claude/skills/
openclaw: shim copies from .claude/skills/ → <workspace>/skills/
Skills must load from the environment, never injected into prompts

Output structure

jobs/{job_name}/{trial_name}/
├── result.json              # rewards, agent, timing
├── prompts.json             # prompts sent
├── trajectory/
│   └── acp_trajectory.jsonl # tool calls + agent thoughts
└── verifier/
    ├── reward.txt           # reward value
    └── ctrf.json            # test results

Tips

Use claude-haiku-4-5-20251001 for testing. Use Sonnet for real benchmarks.
Jobs resume — re-running the same jobs_dir skips completed tasks.
None in prompts list gets replaced with instruction.md content.
Partial rewards work (verifier can write 0.5 to reward.txt).

benchflow

このリポジトリの他の Skills

このリポジトリの他の Skills

BenchFlow — Agent Benchmarking

Dispatch on arguments

No args or status — show current state

run <task-path> — run a single task

job <tasks-dir> — run a benchmark suite

metrics <jobs-dir> — analyze results

view <trial-dir> — view a trajectory

create-task — create a new benchmark task

agents — list available agents

compare — multi-agent comparison

Setup

Environments

Skills in tasks

Output structure

Tips

BenchFlow — Agent Benchmarking

Dispatch on arguments

No args or status — show current state

run <task-path> — run a single task

job <tasks-dir> — run a benchmark suite

metrics <jobs-dir> — analyze results

view <trial-dir> — view a trajectory

create-task — create a new benchmark task

agents — list available agents

compare — multi-agent comparison

Setup

Environments

Skills in tasks

Output structure

Tips

No args or `status` — show current state

`run <task-path>` — run a single task

`job <tasks-dir>` — run a benchmark suite

`metrics <jobs-dir>` — analyze results

`view <trial-dir>` — view a trajectory

`create-task` — create a new benchmark task

`agents` — list available agents

`compare` — multi-agent comparison

No args or `status` — show current state

`run <task-path>` — run a single task

`job <tasks-dir>` — run a benchmark suite

`metrics <jobs-dir>` — analyze results

`view <trial-dir>` — view a trajectory

`create-task` — create a new benchmark task

`agents` — list available agents

`compare` — multi-agent comparison