一键在 Manus 中运行任何 Skill

$pwd:

incident-investigation

Name: Incident Investigation
Author: cnoe-io

// Correlate PagerDuty incidents with Jira tickets and recent ArgoCD deployments to accelerate root cause analysis. Orchestrates multiple agents to build a timeline of events. Use when investigating active incidents, performing post-mortems, or correlating alerts with changes.

在 Manus 中运行

$ git log --oneline --stat

stars:366

forks:64

updated:2026年4月15日 11:21

文件资源管理器

2 个文件

SKILL.md

readonly

name	incident-investigation
description	Correlate PagerDuty incidents with Jira tickets and recent ArgoCD deployments to accelerate root cause analysis. Orchestrates multiple agents to build a timeline of events. Use when investigating active incidents, performing post-mortems, or correlating alerts with changes.

Incident Investigation

Perform multi-agent investigation by correlating PagerDuty incidents, Jira tickets, and ArgoCD deployment history to identify root cause and impacted systems.

Instructions

Phase 1: Gather Incident Data (PagerDuty Agent)

Fetch active incidents - list all triggered and acknowledged incidents
For each incident, collect:
- Incident ID, title, urgency, and status
- Service affected and escalation policy
- Triggered timestamp and duration
- Assigned responders and acknowledgment status
- Alert details and monitoring source

Phase 2: Correlate with Tickets (Jira Agent)

Search for related Jira tickets using:
- Incident ID or service name in ticket descriptions
- Recent tickets with labels like `incident`, `outage`, `p0`, `p1`
- Tickets linked to the affected service or component
For each related ticket, collect:
- Ticket key, summary, status, assignee
- Priority and labels
- Comments with recent updates

Phase 3: Check Recent Deployments (ArgoCD Agent)

Search for recent deployments in the last 24 hours:
- Applications related to the affected service
- Any applications with recent sync operations
- Failed syncs or rollbacks
For each deployment, collect:
- Application name and sync status
- Deployment timestamp
- Revision/commit that was deployed
- Sync result (success, failed, pruned resources)

Phase 4: Build Incident Timeline

Merge all events into a chronological timeline:
- Deployments -> Alerts triggered -> Incident created -> Responses
Identify correlations:
- Did a deployment happen shortly before the incident?
- Are multiple services affected (blast radius)?
- Is there a pattern (recurring incident)?
Assess impact:
- Which services/teams are impacted?
- Customer-facing or internal only?
- Estimated time to resolution

Output Format

```markdown

Incident Investigation Report

Active Incidents

Incident	Service	Urgency	Duration	Status
INC-1234	payment-api	High	45m	Acknowledged

Timeline

Time (UTC)	Event	Source
14:00	ArgoCD sync: payment-api v2.3.1 deployed	ArgoCD
14:12	Alert: payment-api error rate >5%	PagerDuty
14:15	Incident INC-1234 created (High urgency)	PagerDuty
14:18	Acknowledged by @oncall-engineer	PagerDuty

Probable Root Cause

Deployment of payment-api v2.3.1 at 14:00 UTC introduced a regression.

Recommended Actions

Immediate: Rollback payment-api to v2.3.0 via ArgoCD
Short-term: Review commit diff between v2.3.0 and v2.3.1
Follow-up: Create post-mortem Jira ticket ```

Examples

"Show me active PagerDuty incidents and find related Jira tickets"
"Investigate the current outage - what changed recently?"
"Correlate the payment-api incident with recent deployments"
"Build a timeline for incident INC-1234"

Guidelines

Always start with PagerDuty for the source of truth on active incidents
Look at the 24-hour window before the incident for deployment correlation
If no incidents are active, report that clearly and suggest checking resolved incidents
When multiple incidents exist, group by service to identify blast radius
Include direct links to PagerDuty incidents and Jira tickets in the output
For recurring incidents, note the frequency and link to previous occurrences
Never suggest changes that could make the situation worse (e.g., deploying during an active incident)

related-skills.json

同仓库

release-docs.md

from "cnoe-io/ai-platform-engineering"

Generate a combined release blog post for ai-platform-engineering. Produces a single docs/releases/YYYY-MM-DD-release-X-Y-Z.md file containing release notes and the upgrade guide (migration guide) inline. Use when cutting a release, when a user asks "what changed in 0.4.x", or when upgrading their values.yaml to a new chart version.

2026-05-12366

update-docs.md

from "cnoe-io/ai-platform-engineering"

Audit and update all documentation moving parts for ai-platform-engineering. Checks release blog posts, features page, agent docs, homepage version strings, Docusaurus version config, and sidebar completeness. Fixes what is stale and reports what needs manual attention. Use after cutting a release, adding a new agent, or updating platform features.

2026-05-07366

incident-postmortem-report.md

from "cnoe-io/ai-platform-engineering"

Produce a thorough incident post-mortem report after an outage or customer-impacting event. Covers executive summary, impact, detailed timeline, root cause, contributing factors, corrective and preventive actions, and lessons learned. Use when the user asks to write, draft, or complete a post-mortem, blameless review, or incident review document.

2026-05-05366

aws-cost-analysis.md

from "cnoe-io/ai-platform-engineering"

Analyze AWS costs by service, account, and time period. Identifies top spenders, cost anomalies, and optimization opportunities. Use when reviewing cloud spend, preparing cost reports, or investigating unexpected charges.

2026-04-15366

check-deployment-status.md

from "cnoe-io/ai-platform-engineering"

Check the health and sync status of all ArgoCD applications across clusters. Identifies out-of-sync, degraded, or unhealthy deployments and provides actionable remediation steps. Use when monitoring deployments, troubleshooting sync failures, or verifying environment health after a release.

2026-04-15366

cluster-resource-health.md

from "cnoe-io/ai-platform-engineering"

Check Kubernetes cluster health including pod status, node conditions, resource utilization, and pending alerts across EKS clusters. Use when monitoring infrastructure health, investigating capacity issues, or performing cluster audits.

2026-04-15366

package.json

"author": "cnoe-io"

"repository": "cnoe-io/ai-platform-engineering"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

网络与计算机系统管理员计算机与数学类职业15-1244L4

name	incident-investigation
description	Correlate PagerDuty incidents with Jira tickets and recent ArgoCD deployments to accelerate root cause analysis. Orchestrates multiple agents to build a timeline of events. Use when investigating active incidents, performing post-mortems, or correlating alerts with changes.

Incident Investigation

Perform multi-agent investigation by correlating PagerDuty incidents, Jira tickets, and ArgoCD deployment history to identify root cause and impacted systems.

Instructions

Phase 1: Gather Incident Data (PagerDuty Agent)

Fetch active incidents - list all triggered and acknowledged incidents
For each incident, collect:
- Incident ID, title, urgency, and status
- Service affected and escalation policy
- Triggered timestamp and duration
- Assigned responders and acknowledgment status
- Alert details and monitoring source

Phase 2: Correlate with Tickets (Jira Agent)

Search for related Jira tickets using:
- Incident ID or service name in ticket descriptions
- Recent tickets with labels like `incident`, `outage`, `p0`, `p1`
- Tickets linked to the affected service or component
For each related ticket, collect:
- Ticket key, summary, status, assignee
- Priority and labels
- Comments with recent updates

Phase 3: Check Recent Deployments (ArgoCD Agent)

Search for recent deployments in the last 24 hours:
- Applications related to the affected service
- Any applications with recent sync operations
- Failed syncs or rollbacks
For each deployment, collect:
- Application name and sync status
- Deployment timestamp
- Revision/commit that was deployed
- Sync result (success, failed, pruned resources)

Phase 4: Build Incident Timeline

Merge all events into a chronological timeline:
- Deployments -> Alerts triggered -> Incident created -> Responses
Identify correlations:
- Did a deployment happen shortly before the incident?
- Are multiple services affected (blast radius)?
- Is there a pattern (recurring incident)?
Assess impact:
- Which services/teams are impacted?
- Customer-facing or internal only?
- Estimated time to resolution

Output Format

```markdown

Incident Investigation Report

Active Incidents

Incident	Service	Urgency	Duration	Status
INC-1234	payment-api	High	45m	Acknowledged

Timeline

Time (UTC)	Event	Source
14:00	ArgoCD sync: payment-api v2.3.1 deployed	ArgoCD
14:12	Alert: payment-api error rate >5%	PagerDuty
14:15	Incident INC-1234 created (High urgency)	PagerDuty
14:18	Acknowledged by @oncall-engineer	PagerDuty

Probable Root Cause

Deployment of payment-api v2.3.1 at 14:00 UTC introduced a regression.

Recommended Actions

Immediate: Rollback payment-api to v2.3.0 via ArgoCD
Short-term: Review commit diff between v2.3.0 and v2.3.1
Follow-up: Create post-mortem Jira ticket ```

Examples

"Show me active PagerDuty incidents and find related Jira tickets"
"Investigate the current outage - what changed recently?"
"Correlate the payment-api incident with recent deployments"
"Build a timeline for incident INC-1234"

Guidelines

Always start with PagerDuty for the source of truth on active incidents
Look at the 24-hour window before the incident for deployment correlation
If no incidents are active, report that clearly and suggest checking resolved incidents
When multiple incidents exist, group by service to identify blast radius
Include direct links to PagerDuty incidents and Jira tickets in the output
For recurring incidents, note the frequency and link to previous occurrences
Never suggest changes that could make the situation worse (e.g., deploying during an active incident)

incident-investigation

Incident Investigation

Instructions

Phase 1: Gather Incident Data (PagerDuty Agent)

Phase 2: Correlate with Tickets (Jira Agent)

Phase 3: Check Recent Deployments (ArgoCD Agent)

Phase 4: Build Incident Timeline

Output Format

Incident Investigation Report

Active Incidents

Timeline

Probable Root Cause

Recommended Actions

Examples

Guidelines

同仓库更多 Skills

同仓库更多 Skills

Incident Investigation

Instructions

Phase 1: Gather Incident Data (PagerDuty Agent)

Phase 2: Correlate with Tickets (Jira Agent)

Phase 3: Check Recent Deployments (ArgoCD Agent)

Phase 4: Build Incident Timeline

Output Format

Incident Investigation Report

Active Incidents

Timeline

Probable Root Cause

Recommended Actions

Examples

Guidelines