| name | incident-investigation |
| description | Correlate PagerDuty incidents with Jira tickets and recent ArgoCD deployments to accelerate root cause analysis. Orchestrates multiple agents to build a timeline of events. Use when investigating active incidents, performing post-mortems, or correlating alerts with changes. |
Incident Investigation
Perform multi-agent investigation by correlating PagerDuty incidents, Jira tickets, and ArgoCD deployment history to identify root cause and impacted systems.
Instructions
Phase 1: Gather Incident Data (PagerDuty Agent)
- Fetch active incidents - list all triggered and acknowledged incidents
- For each incident, collect:
- Incident ID, title, urgency, and status
- Service affected and escalation policy
- Triggered timestamp and duration
- Assigned responders and acknowledgment status
- Alert details and monitoring source
Phase 2: Correlate with Tickets (Jira Agent)
- Search for related Jira tickets using:
- Incident ID or service name in ticket descriptions
- Recent tickets with labels like `incident`, `outage`, `p0`, `p1`
- Tickets linked to the affected service or component
- For each related ticket, collect:
- Ticket key, summary, status, assignee
- Priority and labels
- Comments with recent updates
Phase 3: Check Recent Deployments (ArgoCD Agent)
- Search for recent deployments in the last 24 hours:
- Applications related to the affected service
- Any applications with recent sync operations
- Failed syncs or rollbacks
- For each deployment, collect:
- Application name and sync status
- Deployment timestamp
- Revision/commit that was deployed
- Sync result (success, failed, pruned resources)
Phase 4: Build Incident Timeline
- Merge all events into a chronological timeline:
- Deployments -> Alerts triggered -> Incident created -> Responses
- Identify correlations:
- Did a deployment happen shortly before the incident?
- Are multiple services affected (blast radius)?
- Is there a pattern (recurring incident)?
- Assess impact:
- Which services/teams are impacted?
- Customer-facing or internal only?
- Estimated time to resolution
Output Format
```markdown
Incident Investigation Report
Active Incidents
| Incident | Service | Urgency | Duration | Status |
|---|
| INC-1234 | payment-api | High | 45m | Acknowledged |
Timeline
| Time (UTC) | Event | Source |
|---|
| 14:00 | ArgoCD sync: payment-api v2.3.1 deployed | ArgoCD |
| 14:12 | Alert: payment-api error rate >5% | PagerDuty |
| 14:15 | Incident INC-1234 created (High urgency) | PagerDuty |
| 14:18 | Acknowledged by @oncall-engineer | PagerDuty |
Probable Root Cause
Deployment of payment-api v2.3.1 at 14:00 UTC introduced a regression.
Recommended Actions
- Immediate: Rollback payment-api to v2.3.0 via ArgoCD
- Short-term: Review commit diff between v2.3.0 and v2.3.1
- Follow-up: Create post-mortem Jira ticket
```
Examples
- "Show me active PagerDuty incidents and find related Jira tickets"
- "Investigate the current outage - what changed recently?"
- "Correlate the payment-api incident with recent deployments"
- "Build a timeline for incident INC-1234"
Guidelines
- Always start with PagerDuty for the source of truth on active incidents
- Look at the 24-hour window before the incident for deployment correlation
- If no incidents are active, report that clearly and suggest checking resolved incidents
- When multiple incidents exist, group by service to identify blast radius
- Include direct links to PagerDuty incidents and Jira tickets in the output
- For recurring incidents, note the frequency and link to previous occurrences
- Never suggest changes that could make the situation worse (e.g., deploying during an active incident)