with one click
update-rubric
// Propose new rubric criteria based on review findings — creates a branch and PR
// Propose new rubric criteria based on review findings — creates a branch and PR
Convert a TB3 task from Harbor's shared verifier mode (default) to separate verifier mode. Use when the user asks to "convert this task to separate verifier", "make the verifier run in its own container", or asks about Harbor's separate-verifier environment for a specific task.
Sum agent-trial spend ($) from TB3 PR comments and Modal compute spend. Reports totals by kind (/run vs /cheat) and provider (anthropic, openai, ...), top-spender PRs, per-task latest trial cost, and daily Modal billing. Use when the user asks how much was spent on agent trials, cost breakdown by provider, PR-level spend, or Modal usage.
Run the sandboxed deep-review tool against a benchmark task PR — launches Claude Code in Docker with pre-fetched PR artifacts and writes review-summary.md / issues-found.md
Generate TB3 review status report and open in browser
| name | update-rubric |
| description | Propose new rubric criteria based on review findings — creates a branch and PR |
| allowed-tools | Bash, Read, Write, Edit, Agent |
| argument-hint | <criteria-description> |
Propose new criteria for the task implementation rubric. This is typically invoked after /review-task identifies general patterns that should be caught for all future tasks.
Run this from within your local clone of the benchmark repo.
Detect the repo root and read the current rubric to understand the format:
REPO_ROOT=$(git rev-parse --show-toplevel)
cat "$REPO_ROOT/rubrics/task-implementation.toml"
Each criterion follows this TOML format:
[[criteria]]
name = "criterion_name"
description = "One-line description"
guidance = """
Multi-line guidance explaining what to check.
PASS if ... FAIL if ..."""
Parse $ARGUMENTS for the criteria to add. If invoked from /review-task, the rubric improvement candidates will be described in the review summary. If invoked standalone, the user provides a description of what to add.
For each candidate, draft a new [[criteria]] entry matching the existing format:
name: snake_case, concisedescription: one-line summaryguidance: detailed explanation with PASS/FAIL conditionsPresent the drafted criteria to the user for review before proceeding.
Detect the GitHub remote and create a branch:
REPO_ROOT=$(git rev-parse --show-toplevel)
GITHUB_REPO=$(git remote get-url origin | sed 's|.*github.com[:/]\(.*\)\.git$|\1|; s|.*github.com[:/]\(.*\)$|\1|')
BRANCH="rubric/add-criteria-$(date +%Y%m%d)"
cd "$REPO_ROOT"
git checkout -b "$BRANCH"
Append the new [[criteria]] entries to rubrics/task-implementation.toml.
cd "$REPO_ROOT"
git add rubrics/task-implementation.toml
git commit -m "Add rubric criteria: <brief description>"
git push -u origin "$BRANCH"
gh pr create --repo "$GITHUB_REPO" \
--title "Add rubric criteria: <brief description>" \
--body "$(cat <<'EOF'
## Summary
- Adds new criteria to `rubrics/task-implementation.toml` based on patterns observed during task review
## New Criteria
<list each new criterion name and one-line description>
## Motivation
<which review or PR prompted this, what pattern was observed>
## Test plan
- [ ] Run `harbor check` against existing tasks to verify new criteria don't cause false positives
- [ ] Verify TOML syntax is valid
Generated with [Claude Code](https://claude.ai/claude-code)
EOF
)"
Open the PR URL in the browser and report it to the user.
After the PR is merged, remind the user to propagate to downstream repos:
git fetch template
git merge template/main
git push origin main
And switch back to the main branch:
git checkout main
git branch -d "$BRANCH"