| name | oncall-runner |
| description | Use the moment a page fires and you are the oncall — an alert ID lands in PagerDuty, a "who's oncall, this just paged" Slack ping, or you wake up to a firing alert and need to triage before you understand it. Fetches the alert by ID, runs the standard first-five-minutes checks (is it flapping, did a deploy just land, is this cause or symptom), and emits a structured finding you can paste into the incident channel. Run this before manual digging. |
Oncall Runner
Overview
When a page fires, the first five minutes are wasted the same way every time:
pulling up the alert, figuring out if it already auto-resolved, checking whether a
deploy just landed, and deciding whether the thing that paged is the cause or
just a downstream symptom. This skill does that mechanically so you can think.
scripts/oncall.py takes an alert ID, fetches the alert, runs the standard
checks, and prints a structured finding. It is deliberately read-only — it
investigates and reports; it never remediates. You decide what to do with the
finding.
When to Use
Use this skill when:
- A page just fired and you are oncall — you have an alert ID (from PagerDuty,
Opsgenie, or the alert webhook) and need a fast triage.
- Someone pings the oncall channel with "this just paged, what is it?" and an
alert ID or link.
- You're handed a stale page — woke up to a firing alert and need to know if
it's still real or already auto-resolved.
Do NOT use this skill when:
- You don't have an alert ID — this skill keys off the alert, not free-text
symptoms. For symptom-first investigation of the inference API, use
inference-api-debugging.
- You're already past triage and deep in root-causing a confirmed incident.
- You need to correlate logs for a specific request — use
log-correlator.
How to Use
Run the script with the alert ID:
python scripts/oncall.py --alert-id PD-48213
The script will:
- Fetch the alert by ID and print its current state, severity, and the
metric/threshold that tripped it.
- Run the standard checks:
- Dedup / flapping — how many times has this alert fired and resolved in
the lookback window? A high flap count means the underlying signal is noisy.
- Auto-resolve status — did it already clear on its own? A page that's
already resolved is a different decision than one still firing.
- Recent deploy correlation — was there a deploy to the affected service in
the minutes before the alert? This is the single highest-value signal.
- Cause vs symptom — does the alert sit downstream of another currently-
firing alert? A 5xx page is often a symptom of a saturation or OOM page.
- Emit a structured finding — severity, the checks' results, a cause/symptom
call, and a recommended next step.
Paste the finding into the incident channel. If the finding says "likely symptom
of ," go investigate that alert next (re-run the script with its ID).
Add --lookback 30m to widen the flap/deploy window, or --json to get
machine-readable output for chaining.
Gotchas
- ALWAYS check the flap count before paging humans. An alert that has fired and
auto-resolved six times in the last hour is a noisy alert, not six incidents.
Escalating a flapping alert as if it were a fresh incident burns the team's
trust and their sleep. The script surfaces flap count first for exactly this
reason — read it before you act.
- An already-resolved page still deserves a look, but not a war room. Auto-
resolved doesn't mean nothing happened — it may have shed load or recovered on a
retry. Note it, check the deploy correlation, but don't spin up an incident for
a green alert.
- Deploy correlation is necessary but not sufficient. A deploy landing right
before the alert is the prime suspect, but coincidences happen — a traffic spike
can coincide with an unrelated deploy. The script reports the correlation; you
still confirm causation (e.g. does rolling back fix it, does the timing line up
to the second).
- Distinguish cause from symptom before declaring root cause. The alert that
paged is frequently the loudest symptom, not the cause. A latency or 5xx page
often sits downstream of a saturation, OOM, or dependency alert that fired first
and quieter. The script checks for upstream firing alerts — if it finds one,
investigate that one, not the one that woke you.
- The script is read-only on purpose. It will never ack, resolve, silence, or
remediate. That's your call — automating remediation off a possibly-flapping
signal is how outages get worse.
Files
SKILL.md — this file: when to trigger, how to run the script, gotchas.
scripts/oncall.py — fetches an alert by ID, runs the standard first-response
checks (flap/dedup, auto-resolve, deploy correlation, cause-vs-symptom), and
emits a structured finding. Run this first when a page fires.