| name | triage_ci_failure |
| description | Triage CI failures, flaky tests, and broken builds in the sequencer mono-repo. Auto-invoke when a user mentions a failing CI job, flaky test, red check, or pastes a GitHub Actions URL — context (PR link, CI job link, base branch) must be gathered BEFORE any code investigation begins. |
Triage CI Failure
When invoked (typically because someone tagged Claude in the mono-repo Slack channel about a CI failure or flaky test), follow this workflow to gather context before investigating.
Step 1: Gather Required Context
Before starting any investigation, you MUST have the following information. Check if any of these are missing from the message or thread:
Required Information
| Item | Why Needed | Example |
|---|
| PR link or branch name | To understand what code is being tested | https://github.com/starkware-libs/sequencer/pull/12345 or feature/my-branch |
| Failed CI job link | To get a details_url you can open and ask the user to paste relevant log lines from | https://github.com/starkware-libs/sequencer/actions/runs/123456/job/789 |
| Base branch | The branch this PR targets — check scripts/parent_branch.txt for the default, don't assume main | main, release/v1.2, feature/epic-branch |
| Is this a new failure or flaky? | Determines investigation approach | "Started failing today" vs "Fails ~10% of runs" |
Nice to Have
- Error message snippet (the available GitHub MCP tools only expose check-run metadata, not raw Actions log output, so a pasted snippet often unblocks the fastest investigation)
- Whether this was working before a recent rebase
- Related PRs or recent merges that might have caused regression
Step 2: If Missing Information, Ask First
If ANY required information is missing, reply in the thread (Slack or PR comment, wherever you were invoked) asking for it. Do NOT start investigating with incomplete context.
Template response:
To investigate this properly, I need a bit more context:
Once I have these, I'll dig in!
Adapt this based on what's already provided — only ask for what's missing.
Step 3: Verify the Context
Once you have the required information:
- Open the PR — use
mcp__github__pull_request_read with method=get to confirm the base branch, changed files, and any existing review comments
- Inspect the failed check — use
method=get_check_runs for status/conclusion and the details_url; for raw Actions logs you'll need the user to paste them (no MCP tool returns them directly)
- Check if known flaky — search CLAUDE.md "Common Gotchas" and recent Slack history for known flaky tests
- Determine scope — is this related to the PR's changes, or a pre-existing/infrastructure issue?
Step 4: Investigate and Report
Only after completing steps 1-3, begin your investigation:
- If it's a code issue in the PR: identify the root cause, propose a fix
- If it's a known flaky test: link to prior discussions, explain the flakiness pattern
- If it's infrastructure/transient: suggest a re-run and explain why
- If unclear: share what you found and what you'd need to dig deeper
Always report back in the thread with:
- What you found
- Whether action is needed
- Proposed next steps (if any)
Step 5: Commit and Push
When fixing the issue, create one commit per PR.
Common Patterns in This Repo
From CLAUDE.md — these failures are often NOT code bugs:
blockifier_reexecution — transient GCloud network issues; suggest re-run
merge-gatekeeper / merge-gatekeeper-new — downstream failures (other checks failed first)
- Formatting failures — run
scripts/rust_fmt.sh (uses pinned nightly toolchain), NOT cargo fmt directly