con un clic
docs-review
// Review benchflow documentation for drift, staleness, duplication, and alignment
// Review benchflow documentation for drift, staleness, duplication, and alignment
Incorporate feedback from an independent code reviewer to improve your solution. The reviewer is a different agent that analyzed your work.
Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
Verify academic citations, detect hallucinated BibTeX entries, repair DOI metadata, and produce normalized bibliography outputs without inventing sources.
Delegate complex coding tasks to a specialist model. Use when facing algorithmic challenges, performance optimization, or tricky debugging that benefits from focused code expertise.
Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
Pre-push branch reviewer — runs lint+typecheck+tests, then fans /code-cleanup, /test-review, /docs-review at the branch diff, merges findings by file
| name | docs-review |
| description | Review benchflow documentation for drift, staleness, duplication, and alignment |
| user-invocable | true |
Review the repo's documentation against the current codebase and surface a punch list. Do not auto-fix. Report findings; let the user approve edits.
The user may say /docs-review with an optional argument:
/docs-review — full review across all in-scope docs/docs-review <path> — single doc (e.g. /docs-review README.md)/docs-review --drift — fast subset: drift-vs-code + stale refs onlydocs/architecture.mddocs/cli-reference.mddocs/task-authoring.mddocs/getting-started.mddocs/labs.mdREADME.mdAGENTS.md.dev-docs/sdk-reference.md — internal SDK surface; verify class/function
names + signatures still resolve in src/benchflow/..dev-docs/harden-sandbox.md — sandbox hardening notes; verify referenced
files / knobs / env vars still exist..dev-docs/tested-agents.md — matrix of agent × model × provider; verify
names still appear in agents/registry.py and agents/providers.py..dev-docs/sdk-refactor-notes.md — dated refactor record (April 2026);
historical, status language is expected. Do not flag or edit.*-notes.md, *-archive.md..smoke-jobs/, trajectories/, examples/, fixtures/ — generated or
sample output, not documentation.Project-structure trees, module one-liners, env var names, registry entries. Cross-check:
ls src/benchflow/, ls src/benchflow/agents/, ls src/benchflow/acp/,
ls src/benchflow/cli/ against trees in architecture.md and Key modules blocks in README/AGENTS.sdk.py still own what the doc claims?
Does job.py still drive the run loop? Spot-check first ~40 lines of
each named module.src/benchflow/agents/registry.py and
src/benchflow/agents/providers.py. A name in docs but not in the
registry dict → stale; a name in the registry but not documented where
expected (docs/architecture.md matrix, .dev-docs/tested-agents.md)
→ gap.ANTHROPIC_API_KEY, OPENAI_API_KEY,
GROQ_API_KEY, BENCHFLOW_*, etc.) — still referenced in
src/benchflow/?pyproject.toml — python version pin, dep names, extras. Verify
Setup / Install blocks in README and docs/getting-started.md.Grep each doc for file paths, function names, class names, CLI commands, task/agent IDs. For each hit, verify it resolves in the current tree:
ls (watch for renames — e.g. a file split into a
package, a private module prefix added like _acp_run.py).register_agent, SDK,
RunResult, detect_services_from_dockerfile, …) → Grep in
src/benchflow/ and the __init__.py re-exports.benchflow run, benchflow ls, benchflow view, …) →
check the Typer app in src/benchflow/cli/.examples/ and
fixtures/.Grep for implementation-tracking words:
CURRENT, NEXT, shipped, Phase \d, proposed, planned, not started, TODO, FIXME, WIP.
For each hit, ask: is this describing the design (stays true) or
in-flight work (rots)? In-flight language belongs in commit messages,
PR descriptions, or .dev-docs/*-notes.md, not user-facing reference
docs.
Suppress for .dev-docs/*-notes.md — dated refactor notes legitimately
carry status language.
Any fact stated in ≥2 docs that could be a link instead? Big offenders for benchflow:
architecture.md)architecture.md; others should link.architecture.md + one in
task-authoring.md or .dev-docs/sdk-reference.md is OK if they
illustrate distinct use cases; two near-identical register_agent(...)
blocks is not..dev-docs/tested-agents.md; architecture.md should link, not
duplicate.docs/getting-started.md or
docs/cli-reference.md; not re-listed in README.Target state: architecture.md is the sole deep reference for internals,
cli-reference.md for commands, task-authoring.md for task YAML /
verifier shape. README and AGENTS.md link to them instead of duplicating.
docs/cli-reference.md flag list ↔ actual Typer definitions in
src/benchflow/cli/. Every documented flag resolves; every command in
the CLI has a documented entry (or an intentional hide).docs/task-authoring.md YAML schema ↔ TaskConfig / loader in
src/benchflow/tasks.py. Every field has a loader path.docs/architecture.md "Error Taxonomy" / "Trajectory event format"
sections ↔ the actual dataclass fields in src/benchflow/models.py
and emit sites in job.py / _trajectory.py.docs/architecture.md ACP Protocol section ↔ src/benchflow/acp/ and
_acp_run.py.[text](other-doc.md) link
still point at a section that exists?All markdown links resolve:
[text](path) → file exists[text](path#Lnum) → line exists (file has ≥ N lines)[text](#heading) → heading exists in the same doc[text](../foo.md) → relative path resolves`src/benchflow/sdk.py`) still exist —
not strictly links, but the same drift vector.Key modules + link
to docs/architecture.md). No deep rationale.For a full review:
Dispatch in parallel. Spawn one Explore agent per full-review
doc. Each agent gets: the doc path, the seven checks, "report ≤ 250
words, concrete file:line references only, no prose rewrites."
Light-touch docs get a trimmed prompt (checks 1, 2, 6 only). Skip
entirely the docs under "Skipped entirely."
Synthesize. Merge agent findings into a single punch list. Group by severity:
Each item: <severity> · <doc>:<line> — <one-line description>.
Ask for approval. Present the punch list. Do NOT start editing. Wait for "fix 1-4", "ignore 5, it's intentional", "all", or similar.
Apply fixes. Edit only what was approved. After edits, re-verify the specific items you touched (don't re-run the full review).
.dev-docs/*-notes.md
legitimately carry status language and reflect state at the time they
were written; don't normalize them.registry.py and providers.py are the source of truth — a name
missing from docs is a doc bug; a name missing from the registry is a
code bug (surface separately, don't silently "fix" docs).Docs review punch list (3 blocker, 4 stale, 2 polish)
Blockers:
- docs/architecture.md:114 — references AgentConfig at src/benchflow/agents/registry.py, but file defines AgentSpec (renamed)
- docs/cli-reference.md:142 — documents `benchflow verify --strict` flag; flag doesn't exist in cli/verify.py
- README.md:62 — example uses register_agent(..., model=...) kwarg; signature takes models=[...] (list)
Stale:
- docs/architecture.md:39 — "Phase 1: SETUP (host)" numbering implies sequential work-in-progress; phases are always-on
- docs/task-authoring.md:88 — "TODO: document verifier timeout knob"
- docs/getting-started.md:121 — example task ID "demo-fizzbuzz" renamed to "examples-fizzbuzz"
- .dev-docs/tested-agents.md:14 — lists claude-code-acp; registry only has claude-agent-acp
Polish:
- README.md:98-134 — full src/ tree duplicates docs/architecture.md:12-38
- AGENTS.md:14-22 — Setup block duplicates docs/getting-started.md; consider linking
Reply with which to fix, or "all".