with one click
deep-review-task
// Run the sandboxed deep-review tool against a benchmark task PR — launches Claude Code in Docker with pre-fetched PR artifacts and writes review-summary.md / issues-found.md
// Run the sandboxed deep-review tool against a benchmark task PR — launches Claude Code in Docker with pre-fetched PR artifacts and writes review-summary.md / issues-found.md
Convert a TB3 task from Harbor's shared verifier mode (default) to separate verifier mode. Use when the user asks to "convert this task to separate verifier", "make the verifier run in its own container", or asks about Harbor's separate-verifier environment for a specific task.
Sum agent-trial spend ($) from TB3 PR comments and Modal compute spend. Reports totals by kind (/run vs /cheat) and provider (anthropic, openai, ...), top-spender PRs, per-task latest trial cost, and daily Modal billing. Use when the user asks how much was spent on agent trials, cost breakdown by provider, PR-level spend, or Modal usage.
Generate TB3 review status report and open in browser
Propose new rubric criteria based on review findings — creates a branch and PR
| name | deep-review-task |
| description | Run the sandboxed deep-review tool against a benchmark task PR — launches Claude Code in Docker with pre-fetched PR artifacts and writes review-summary.md / issues-found.md |
| allowed-tools | Bash |
| argument-hint | <pr-url> [--review] |
Run the tools/deep-review sandboxed reviewer against a benchmark task PR. The tool spins up Claude Code inside a locked-down Docker container, pre-fetches all PR artifacts, and produces a full GUIDE.md-style review.
Prerequisites: Docker Desktop running, gh CLI authenticated, and either claude CLI logged in or ANTHROPIC_API_KEY set in tools/deep-review/.env. macOS or Linux/WSL.
$ARGUMENTS is the PR URL, optionally followed by --review to auto-post a REQUEST_CHANGES review when the run finishes.
If no PR URL is given, stop and ask the user for one. Do not invent a URL.
Find the repo root and verify tools/deep-review/run.sh exists:
REPO_ROOT=$(git rev-parse --show-toplevel)
RUN_SH="$REPO_ROOT/tools/deep-review/run.sh"
test -x "$RUN_SH" || { echo "deep-review not found at $RUN_SH — pull latest main"; exit 1; }
If missing, tell the user to git pull origin main (the tool landed in PR #605, commit 29b1f353) and stop.
Invoke run.sh directly with the PR URL. The default built-in prompt drives the full GUIDE.md review end-to-end — do not pass a custom prompt unless the user asked for one. Stream output live so the user sees thinking, tool calls, and the final cost line.
cd "$REPO_ROOT/tools/deep-review"
./run.sh --pr "<pr-url>" # or add --review to auto-post
Pass --review only if the user included it in $ARGUMENTS or explicitly asked. Posting to GitHub is visible to others — don't decide on your own.
When the run completes, the task directory is tools/deep-review/tasks/<owner>/<repo>/pr-<N>/. Tell the user the paths to:
review-summary.md — full reviewissues-found.md — auto-extracted ## Issues Found sectionnatural-difficulty-extensions.md — if the agent populated that sectionOffer to open review-summary.md with open -g <path> so the user can read it in their default viewer.
If --review was used, the final Review posted: <url> line from run.sh is the posted review. Surface that URL to the user.
review-summary.md is overwritten by the agent../run.sh "your prompt" --pr <url> — only use this if the user asked for a narrower focus.tools/deep-review/README.md.