一键导入
code-cleanup
// Two-pass subagent sweep for trivial/small refactoring wins — find candidates, then verify each before recommending
// Two-pass subagent sweep for trivial/small refactoring wins — find candidates, then verify each before recommending
Incorporate feedback from an independent code reviewer to improve your solution. The reviewer is a different agent that analyzed your work.
Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
Verify academic citations, detect hallucinated BibTeX entries, repair DOI metadata, and produce normalized bibliography outputs without inventing sources.
Delegate complex coding tasks to a specialist model. Use when facing algorithmic challenges, performance optimization, or tricky debugging that benefits from focused code expertise.
Run agent benchmarks, create tasks, analyze results, and manage agents using BenchFlow. Use when asked to benchmark an AI coding agent, run a benchmark suite, create tasks, view trajectories, or compare agent performance.
Pre-push branch reviewer — runs lint+typecheck+tests, then fans /code-cleanup, /test-review, /docs-review at the branch diff, merges findings by file
| name | code-cleanup |
| description | Two-pass subagent sweep for trivial/small refactoring wins — find candidates, then verify each before recommending |
| user-invocable | true |
Find quick refactoring wins (dead code, duplication, stale comments, latent bugs hiding as style) via parallel subagents, then verify each finding in a second pass before presenting. Most value comes from the verification pass: single-pass sweeps routinely produce plausible-but-wrong suggestions, and a polluted punch list is worse than none.
Do not auto-fix. Report verified findings; let the user approve edits.
Every fix must not grow the file. A cleanup that adds net lines has failed the intent — dead code deletion, dedup via shared helper, trimming verbose comments, and header additions should all come out flat or negative in LOC. Header additions (Cat 7) are the only cleanup that adds lines; they are capped at one docstring line + optional section markers. If a proposed fix would net-increase a file, drop it or reframe as a refactor request for the user, not a cleanup.
/code-cleanup — whole src/benchflow/, split across 3–4 parallel agents/code-cleanup <path> — single file or subtree (e.g. /code-cleanup src/benchflow/agents/)/code-cleanup --recent — limit to files changed in the last ~20 commitsOnly these. If a candidate doesn't fit one of these, drop it.
ty check src/ + git grep before flagging — ty catches some but not all.TODO/FIXME for work that shipped.if x is None after x: str with no reassignment; try/except around code that can't raise; isinstance(x, T) in a body where x: T.pyproject.toml entry (transitive-by-accident), imports pulled in only for types that can move under if TYPE_CHECKING:, conditional imports for platforms we don't support.except Exception: pass), mutable default arguments, shadowed builtins (list, id, type), is / is not on strings or numbers, off-by-one in slicing, fire-and-forget asyncio.create_task without storing a reference, blocking calls inside async def, opening files without with.src/benchflow/**/*.py module gets a one-line top-of-file docstring stating its responsibility (≤ ~110 chars, concrete domain language, no restating the filename). Modules >400 LOC with ≥3 loosely-related symbol groups also get section markers (# ── <section> ──). Serves both agents cold-reading a file and humans scanning the flat src/benchflow/ layout — the filename tells you the domain, the docstring tells you the boundary (e.g. _sandbox.py is 800 LOC mixing user-setup / path-lockdown / verifier-hardening). Co-located beats a central architecture doc because it drifts less. Skip modules that already have a meaningful one-line docstring.Args: / Returns: sections whose lines add nothing beyond the typed signature. Trim to one line or delete — never expand. Strong bias to keep: any comment encoding rationale, invariants, workarounds, cross-module intent, or non-obvious "why" is load-bearing even if long. When in doubt, leave it. Orientation headers (Cat 7) are exempt./arch-audit territory)ruff format owns formatting; don't duplicate)src/benchflow/__init__.py re-exports is a contract; changes are out of scope for this skill/test-review instead, different rules applyDiscovery is unstable run-to-run: two passes over the same code with the same prompts routinely surface disjoint findings. A single sweep misses roughly half the real wins. So run discovery twice and union the results before verification — cheaper and more complete than over-slicing one run.
Each discovery run spawns 2–4 Explore subagents in parallel, each
covering a disjoint slice of the scope. Suggested split for a whole-repo
/code-cleanup:
src/benchflow/sdk.py + src/benchflow/job.py + src/benchflow/_acp_run.py + src/benchflow/_scoring.pysrc/benchflow/_sandbox.py + src/benchflow/_env_setup.py + src/benchflow/_agent_setup.py + src/benchflow/_agent_env.py + src/benchflow/_credentials.pysrc/benchflow/agents/ + src/benchflow/acp/ + src/benchflow/cli/environments.py, metrics.py, models.py, process.py, tasks.py, task_download.py, _trajectory.py, skills.py, viewer.py)Each agent prompt includes:
ruff format / ruff check concerns — CI already owns those."Each returns a ranked list with file:line | excerpt | one-line change | effort (trivial/small).
Union the two runs' findings (dedup by file:line + category), then hand the combined list to Pass 2.
Spawn one Explore subagent per discovery agent's output. Prompt:
"Verify each claim against the actual code at . For each, quote the relevant code (file:line), run
Grep/ty check src/where needed, and return a verdict: real / false positive / nuanced. Include one-sentence justification and an updated effort estimate. Explicitly check: (i) for Cat 1 (dead code), does the symbol have zero in-package importers AND zero external consumers viasrc/benchflow/__init__.pyre-exports? (ii) for Cat 4 (over-defensive), is the 'unreachable' branch truly unreachable after considering optional kwargs,Nonedefaults, andUnpack-style typed dicts? (iii) for Cat 6 (latent bug), can you name the concrete failure mode?"
This pass typically kills 30–50% of Pass 1 findings. That is the point — do not skip it.
Present only verified-real findings, grouped by bucket:
Present the list. Do NOT edit. Wait for "fix 1-4", "all", "skip 3",
"just the latent bugs", or similar. Apply only what's approved. After
edits, run ruff format, ruff check, ty check src/, and
.venv/bin/python -m pytest tests/ (fast unit subset) to re-verify only
what you touched.
/arch-audit.git bisect stays useful.src/benchflow/__init__.py re-exports. Public API
surface — changes go through a normal PR review, not a cleanup sweep.Code-cleanup sweep — 5 verified real (2 latent bugs, 3 trivial), 5 rejected
Latent bugs (separate commit):
- src/benchflow/_acp_run.py:87 — asyncio.create_task(...) result discarded; coroutine may GC mid-run
- src/benchflow/agents/pi_acp_launcher.py:142 — except Exception: pass swallows auth failures silently
Trivial cleanups (janitor commit):
- src/benchflow/metrics.py:220-232 — compute_pass_rate has zero importers outside the module; inline at sole caller
- src/benchflow/_env_setup.py:41 + src/benchflow/_agent_setup.py:58 — _resolve_home_path byte-identical; extract shared util
- src/benchflow/viewer.py:1 — no module docstring; add one-line header (Cat 7)
- src/benchflow/_sandbox.py:1 — 800 LOC, 3 symbol groups; add `# ── User setup ──` / `# ── Path lockdown ──` / `# ── Verifier hardening ──` markers (Cat 7)
Rejected (checked, not real):
- src/benchflow/sdk.py:512 — isinstance check is for dict-vs-TypedDict at the YAML boundary; necessary
- src/benchflow/job.py:74 — TODO is a planning marker matched by an active ticket; leave
- src/benchflow/process.py:198 — try/except around subprocess is required; CalledProcessError is real
- src/benchflow/models.py:22 — RunResult fields look redundant but one feeds the SDK, other feeds viewer
- src/benchflow/tasks.py:31 — Optional[...] | None is defensive but loader feeds untyped YAML; keep
Reply with which to fix, or "all".