| name | babysit-pr |
| description | Use when a PR is open and green-but-blocked, or red on CI for reasons that smell like flake — a timed-out test runner, a transient network 500 in a setup step, a check that passed locally but failed in CI. Reach for this whenever someone says "this PR keeps failing CI but the test is flaky", "can you babysit this PR to merge", "it's just a flaky check, retry it", or wants a PR shepherded through retries, conflict resolution, and auto-merge without sitting on it manually. Prefer this over hand-clicking "Re-run failed jobs" in the GitHub UI, which gives up no signal on flaky-vs-real and forgets to enable auto-merge. |
Babysit PR
Overview
A PR that's correct can still sit unmerged for an hour because one CI job
flaked, a base-branch advance left it behind, or auto-merge was never turned on.
Watching that by hand — refreshing the checks tab, clicking "Re-run failed jobs",
remembering to come back — is exactly the kind of low-judgment loop that should
be automated, but automated carefully: blindly re-running a failing check until
it goes green is how a real bug gets merged.
This skill watches a PR's checks, and when something fails it makes one decision
that actually matters: is this flake, or is this real? Flake gets a bounded
retry (capped — never an infinite loop). A real failure stops and reports back so
a human fixes the code. Merge conflicts get rebased (or merged) against the base.
When everything is green and the required gates are satisfied, it enables
--auto merge so the PR lands the moment GitHub allows it, and gets out of the
way.
The whole point is to be conservative: it would rather hand a PR back to you than
merge something it shouldn't.
When to Use
Reach for this when:
- A PR is open and you believe it's mergeable but CI is intermittently red — a
test runner that OOM'd, a flaky integration test, a transient registry pull
failure in a setup step.
- A PR fell behind its base branch and needs a rebase/merge to satisfy
"branch must be up to date" before it can merge.
- You want a PR to land as soon as it's eligible without watching it — enable
auto-merge once gates pass and walk away.
- You're triaging a queue of your own PRs and want each one nudged through the
last mile.
Do NOT use this when:
- The PR has failing checks that are clearly real (the diff broke a test, a
type error, a lint failure on the changed lines). Babysitting won't fix code —
fix the code.
- Reviews are still pending or changes are requested. This never overrides a
human review gate; it waits for approval, it doesn't bypass it.
- The conflict is semantic and needs judgment (two features touching the same
logic). Auto-rebase resolves textual conflicts in trivial cases only; anything
ambiguous is handed back.
- You don't have merge rights on the repo. The script will just spin on a
permission error — check
gh auth status first.
Running it
cd .claude/skills/babysit-pr
./scripts/babysit.sh 4821
./scripts/babysit.sh
Useful environment knobs (all optional):
export POLL_INTERVAL_S=30
export MAX_RERUNS=2
export FLAKE_ERROR_RATE_MAX=0.05
export CONFLICT_STRATEGY=rebase
export REQUIRE_APPROVAL=1
The script prints a running log of each decision (rerunning check X as flake
attempt 1/2; check Y failed for real, stopping; rebased onto base; auto-merge
enabled) so the reasoning is auditable, and exits non-zero if it gives up so it
composes into a larger automation.
How it decides flaky vs real
This is the load-bearing judgment, so it's deliberate rather than clever:
- Read the failed check's conclusion and logs, not just red/green. A job
that failed with a known-transient signature —
connection reset, i/o timeout, 429 Too Many Requests, an OOM-killed runner, a registry/network
pull failure in a setup step before the test body ran — is a flake
candidate.
- Cross-check the error rate via the
grafana skill. Before calling
something "just flake", confirm the service isn't actually degraded: if
grafana reports the relevant error rate is elevated (above
FLAKE_ERROR_RATE_MAX), the "transient" failure may be a real outage the PR
is surfacing — in that case treat it as REAL and stop. Flake is a quiet
background failure, not a spike everyone is seeing.
- A failure inside the test assertions (an
expect/assert mismatch, a
type error, a lint error on changed lines) is real by default — never
rerun it. Reruns are only for infrastructure-shaped failures.
- Cap reruns at
MAX_RERUNS per check. A check that fails twice after rerun
is no longer "flaky" — it's reliably broken. Stop, report, hand back. The cap
is what prevents the rerun-until-green antipattern that merges real bugs.
When a flake candidate is identified, the script triggers gh run rerun --failed for just that run and resumes polling.
Gotchas
ALWAYS treat these as real failure modes — each has cost someone a bad merge or a
wasted hour.
- Rerun-until-green merges bugs. The single most dangerous thing this
automation can do is keep re-running a genuinely failing check until variance
makes it pass once. That's why assertion/type/lint failures are never rerun,
and why infra-flake reruns are hard-capped at
MAX_RERUNS. If you find
yourself wanting to raise the cap to 5, stop — the check is telling you the
truth.
- "All checks green" is not "mergeable". A PR can show green checks and still
be blocked by a required status check that hasn't reported yet, or by
"branch must be up to date with base". Gate auto-merge on the repo's
required_status_checks and mergeable state from gh pr view --json mergeable,mergeStateStatus, not on the visible check list. Enabling --auto
is the right move here — it asks GitHub to merge when its gates pass, which
is the authoritative answer.
- Never auto-merge while reviews are pending or changes requested. Green CI
is not approval. With
REQUIRE_APPROVAL=1 (default) the script refuses to
enable auto-merge until the review state is approved; it'll wait, not bypass.
Turning this off is only appropriate on repos with no review requirement.
- Rebase vs merge on conflict is not interchangeable. Rebasing rewrites the
PR's commits onto the new base (clean history, but force-push — fine for a
feature branch you own, hostile on a shared branch). Merging the base in keeps
history but adds a merge commit. Default to
rebase for your own PRs; switch
to CONFLICT_STRATEGY=merge when others may have the branch checked out. Either
way, only textual trivial conflicts are auto-resolved — if git rebase
leaves conflict markers, the script aborts the rebase and hands the PR back
rather than committing a half-resolved tree.
- A flaky service is not flaky CI. If
grafana shows the error rate spiking,
the "transient" CI failure is probably the real signal — the PR is catching a
live regression. Treat the spike as REAL and stop; don't paper over an incident
by retrying past it.
- Polling forever costs API quota and hides stalls. The loop has a wall-clock
ceiling; a PR that's been pending for an unreasonable time (a stuck queue, a
never-scheduled required check) is reported as stalled rather than polled
indefinitely. Silence is a failure mode too.
Files
scripts/babysit.sh — the babysitter loop. Resolves the PR (arg or current
branch), polls checks via gh pr checks / gh run view, classifies each
failure as flake-vs-real (cross-checking the grafana skill for error rates),
reruns flakes up to MAX_RERUNS with gh run rerun --failed, rebases/merges
the PR up to date on conflict, and enables gh pr merge --auto once required
gates and review state are satisfied. Referenced by Running it above.