| name | investigate-pd-incident |
| description | Investigate a PagerDuty incident by fetching it via the PagerDuty MCP server, classifying the alert type, and dispatching to the appropriate specialist skill. Use when the user provides a PagerDuty incident URL or ID (e.g. "investigate this PD alert", "what's going on with incident Q09D28GV4W96FQ", "diagnose https://recidiviz.pagerduty.com/incidents/..."). |
Skill: Investigate PagerDuty Incident
Overview
This is the generic entry point for triaging any PagerDuty incident. It pulls
the incident details via the PagerDuty MCP server, classifies what kind of alert
fired, dispatches to a specialist skill that knows how to diagnose that class of
failure, and then (after user confirmation) posts the full diagnosis back to the
PagerDuty incident as a note so the same context is visible to anyone else
triaging ā including future automated invocations of this skill.
The only write action this skill takes is adding a note to the incident, and
only after confirmation. Never acknowledge, resolve, or snooze the incident
unless the user explicitly asks.
Step 1: Verify the PagerDuty MCP server is connected
The PagerDuty MCP tools are named mcp__pagerduty__*. If those tools are not
available in this session, halt and tell the user:
"This skill needs the PagerDuty MCP server. Install it with:
claude mcp remove pagerduty 2>/dev/null; \
claude mcp add --transport http pagerduty https://mcp.pagerduty.com/mcp \
--header "Authorization: Token YOUR_PD_API_TOKEN"
Get a User API token at PagerDuty ā Profile ā My Profile ā User Settings ā
Create API User Token. Then restart Claude Code and re-run this skill.
Do not use the /mcp dialog's Authenticate button ā the hosted PD MCP
does not support OAuth dynamic client registration, which is what Claude
Code's built-in auth flow requires."
Step 2: Extract the incident ID
Accept either:
- A full URL like
https://recidiviz.pagerduty.com/incidents/Q09D28GV4W96FQ ā
the ID is the last path segment
- A bare incident ID (e.g.
Q09D28GV4W96FQ)
- An incident number (e.g.
26962) ā the MCP accepts this too
Step 3: Fetch the incident
Call mcp__pagerduty__get_incident with the ID. Also call
mcp__pagerduty__list_alerts_from_incident with query_model={"limit": 10} ā
the alerts carry richer detail than the incident itself. If the alert body's
details field is non-empty, inspect it. For email-integration alerts it will
usually be empty; that's expected.
Step 4: Establish the deployed version in the alerting project
Before dispatching to a specialist, find the deployed version of the
project that fired the alert. All downstream "what changed" queries must be
scoped to this tag, not to origin/main. Misattributing a prod alert to a
commit that isn't in the deployed version is a common and expensive mistake.
Derive the project id from the incident ā for Recidiviz Airflow alerts, the
dag_id prefix identifies it (recidiviz-123_* = prod, recidiviz-staging_*
= staging). Then run:
./recidiviz/tools/deploy/print_deployed_version.sh <project_id>
The TF-state version is authoritative for what code is running ā
always use that as deployed_tag. The script also prints a Cloud Run
previous row derived from case-triage-web's revision history; use
that as prev_deployed_tag for any "what changed between the last two
deploys" work in the specialist. The script also prints the latest
successful view-update version; if it lags the TF-state version by more
than ~3 hours, the update_managed_views_all task may be failing, which
matters for data-content alerts. For staging the script additionally
prints the latest tag on origin/main (prod has no reliable branch-tip
signal today). See recidiviz/tools/deploy/CLAUDE.md for details.
Pass the resolved deployed_tag (and the prior deployed tag, for diffs) to
the specialist in Step 5.
Step 5: Classify the alert
Use the incident title and service summary to decide which specialist to
invoke. Current dispatch table:
| Match | Specialist |
|---|
Title matches the regex ^Failure: Task Run: (?P<dag_id>[\w-]+)\.(?P<task_id>\w+), started: (?P<start>.+)$ and service summary contains Airflow Tasks | Read and follow .claude/skills/investigate-airflow-failure/SKILL.md ā pass through the parsed dag_id, task_id, incident start datetime, and the deployed_tag from Step 4 |
| (anything else) | Skip Step 6 and go directly to Step 7 |
The Airflow alert title format is produced by
AirflowAlertingIncident.unique_incident_id in
recidiviz/airflow/dags/monitoring/airflow_alerting_incident.py. If that format
changes and the regex stops matching, update both files.
Step 6: Post the diagnosis as a PagerDuty note (specialist path only)
Runs only after a specialist was dispatched in Step 5 and returned a diagnosis.
Skip this step when no specialist matched ā there is nothing to post.
6a: Draft the note
Take the same structured diagnosis you presented to the user and reformat it
into a PagerDuty note. Include all of the sections the specialist produced
ā not just the TLDR ā because this note may be the only context available to
a future reader (another on-caller, or another Claude invocation running
headlessly).
Target sections (in order), using Markdown bold headings rather than # ā PD
renders notes with limited Markdown and nested fenced blocks render poorly:
- A one-line banner:
š Claude Code diagnosis ā <specialist-name>
**TLDR** ā 2ā3 sentences in plain English
**Failure** ā DAG, task, run id, project, Airflow UI URL (one line each)
**Error** ā wrap traceback header + innermost Recidiviz frame + exception
message in a single fenced code block
**Source** ā file_path:line reference followed by a ā¤10-line code
block of the implicated function
**Recent changes to this file** ā bulleted list: ` ā
` with a short "(relevant: ā¦)" or "(no obvious connection)" note
**Suggested next steps** ā bullet list of concrete actions
- A
--- separator, then an attribution line: š¤ Posted by Claude Code via the 'investigate-pd-incident' skill. This is a diagnosis, not a fix.
Keep it faithful to what you already told the user ā do not soften, expand
speculation, or omit the error excerpt. Trim purely repetitive log noise
(e.g. the duplicated reason: 'stopped' lines) but keep the first real
error intact. If the session's diagnosis is already in that shape, you can
mostly copy it verbatim.
6b: Preview and confirm
Show the drafted note to the user in a fenced block and ask whether to post,
edit, or skip. Do not post without explicit confirmation ā add_note_to_incident
is a write action and the note is visible to everyone triaging the incident.
6c: Post the note
On user confirmation, call mcp__pagerduty__add_note_to_incident with the
incident id and the drafted content. Return the URL of the incident (not the
note ā the PD API does not expose a direct note URL) so the user can click
through to verify.
If the user declines, leave the incident alone and tell the user the note
content is preserved in the session transcript if they want to copy it
manually.
Step 7: No specialist matched
If no specialist matches, present a structured summary and stop:
- Incident:
[#<number>] <title> ā link to the PD URL
- Service:
<service.summary>
- Status:
<status> (triggered / acknowledged / resolved)
- Created:
<created_at> UTC
- First alert body (if non-empty): dump the
details JSON
- Recent notes: from
mcp__pagerduty__list_incident_notes if any
Then tell the user:
"No specialist skill exists for this alert type yet. If this class of alert
comes up regularly, we should add a specialist under
.claude/skills/investigate-<type>-failure/SKILL.md and register it in the
dispatch table of investigate-pd-incident/SKILL.md."
Notes
- Write action: the only write tool this skill calls is
mcp__pagerduty__add_note_to_incident, and only with user confirmation in
Step 6. Never call manage_incidents (ack/resolve/snooze) or any other
write tool unless the user explicitly asks.
- Pre-approval of add_note_to_incident is intentionally off: it is not
in
.claude/settings.json's allowlist, so every post surfaces a permission
prompt ā a second confirmation layer on top of the skill's own Step 6b
confirmation. Do not add it to the allowlist without team discussion.
- Never include PD tokens in command output you surface to the user ā they
live in
~/.claude.json and should stay there.
- Future headless use: if this skill is ever invoked non-interactively
(cloud job, cron, etc.), the interactive confirmation in Step 6b will block.
That mode will need an explicit mechanism (caller-provided flag or
pre-approved settings) before it can post autonomously ā do not work
around the confirmation in local sessions.
Related Documentation