Use when a change is written and "looks done" but has not had a hostile second pass before merge — especially diffs touching auth, money, migrations, concurrency, or anything the author is quietly unsure about. Spawns a fresh-eyes reviewer subagent that sees ONLY the diff and the spec, collects findings, drives fixes, and re-dispatches until findings degrade to nitpicks. Reach for this instead of self-reviewing; the author is the worst reviewer of their own diff.
Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge.
Use when writing or reviewing code that meters API token usage, bills accounts, issues invoices, applies credit grants, or computes balances with the internal `billing` library — especially around retries, mid-cycle plan changes, cache-read vs cache-write token pricing, or any place where double-billing or rounding drift would be a problem.
Use when an API-credits checkout or paid-plan upgrade needs to be proven end-to-end against Stripe test mode — confirming a card charge actually creates the invoice and subscription in the right state, reproducing a "I paid but my credits didn't show up" report, checking that a declined or 3DS card fails the way the UI claims, or wiring a billing smoke test into CI so a checkout regression is caught before a customer's money is.
Use when a specific fix that's already on main needs to land on a production/release branch without dragging along everything else — a hotfix to backport, a "cherry-pick this commit onto release-2.4", a "we need just that one PR on prod" request. Reach for this whenever someone wants to port one or a few commits to a release branch and open a PR for it, especially before doing it by hand in their main checkout, which pollutes their working tree and routinely leaves conflict markers committed or loses the original commit's provenance.
Use when writing or editing code in this org's Python or JS/TS, especially before committing or opening a PR — and proactively the moment a diff adds an import, an except/catch, or any logging. Enforces the style rules Claude gets wrong by default: import grouping, error-wrapping (no bare except / empty catch), no leftover debug prints, explicit over clever. Runs scripts/check_style.sh (ruff, mypy --strict, eslint + grep guards) which exits nonzero so it drops into a pre-commit hook or CI.
Use when someone wants to compare two cohorts' retention or conversion, asks whether a difference between segments is "real" or "significant", wants retention curves for an A/B group or a launch vs control, or says one cohort "looks better" and needs the delta flagged with a p-value. Reach for this whenever the question is two-group comparison plus significance — and especially before eyeballing two percentages and declaring a winner, which ignores sample size and observation-window mismatch.
Use when a cloud bill already spiked and nobody knows why — storage or egress jumped, a specific bucket's cost ballooned, a cost-anomaly alert fired, or finance is asking "what changed this week?" Queries the billing export by service/bucket/region, diffs the suspect window against a prior baseline, and ranks the top movers. Insists on accounting for the 24-48h billing lag before declaring a trend, and points the investigation at the usual culprit: cross-region egress. Not for cost forecasting/estimation, general spend optimization, pricing questions, budget-alert setup, or latency "expense" — only for diagnosing an unexpected spike that already happened.