with one click
with one click
Convert a TB3 task from Harbor's shared verifier mode (default) to separate verifier mode. Use when the user asks to "convert this task to separate verifier", "make the verifier run in its own container", or asks about Harbor's separate-verifier environment for a specific task.
Sum agent-trial spend ($) from TB3 PR comments and Modal compute spend. Reports totals by kind (/run vs /cheat) and provider (anthropic, openai, ...), top-spender PRs, per-task latest trial cost, and daily Modal billing. Use when the user asks how much was spent on agent trials, cost breakdown by provider, PR-level spend, or Modal usage.
Run the sandboxed deep-review tool against a benchmark task PR — launches Claude Code in Docker with pre-fetched PR artifacts and writes review-summary.md / issues-found.md
Propose new rubric criteria based on review findings — creates a branch and PR
| name | tb3-status |
| description | Generate TB3 review status report and open in browser |
| allowed-tools | Bash |
Generate the TB3 review status report, open it in the browser, and summarize the key numbers.
IMPORTANT: Run exactly ONE Bash tool call with this chained command:
uv run tools/review-status/generate_report.py --open 2>&1 && uv run tools/review-status/generate_report.py --json 2>/dev/null
The first command generates + opens the HTML report. The second outputs JSON for your summary. Both must be in a single Bash call.
From the JSON, briefly summarize: