with one click
incident-postmortem
Write blameless incident postmortems with structured timeline reconstruction, impact quantification, contributing factor analysis, and actionable follow-up items with owners and deadlines.
Menu
Write blameless incident postmortems with structured timeline reconstruction, impact quantification, contributing factor analysis, and actionable follow-up items with owners and deadlines.
| name | incident-postmortem |
| description | Write blameless incident postmortems with structured timeline reconstruction, impact quantification, contributing factor analysis, and actionable follow-up items with owners and deadlines. |
| metadata | {"displayName":"Incident Postmortem","categories":["operations","engineering"],"tags":["incidents","postmortem","blameless","reliability","incident-response"],"worksWellWithAgents":["devops-engineer","engineering-manager","incident-commander","site-reliability-architect","sre-engineer"],"worksWellWithSkills":["debugging-guide","disaster-recovery-plan","runbook-writing","ticket-writing"]} |
Gather the following from the user:
If the user gives you a vague summary ("the site went down for a bit"), push back: "What specific errors did users see? Which services were affected? When exactly did alerts fire vs. when was the issue resolved?"
Use the following structure for every postmortem:
Write 3-5 sentences covering: what broke, who was affected, how long it lasted, and how it was resolved. This should be understandable by someone outside the team.
On 2024-03-12 at 14:32 UTC, the checkout service began returning 500 errors
for all payment processing requests. Approximately 12,000 users were unable to
complete purchases during the 47-minute outage. The issue was caused by an
expired TLS certificate on the payment gateway. Service was restored at 15:19
UTC by rotating the certificate.
Use UTC timestamps. Include detection lag (time between incident start and first alert). Mark each entry with a category tag.
14:32 UTC [ONSET] First 500 errors appear in payment service logs
14:38 UTC [DETECTION] PagerDuty alert fires for checkout error rate > 5%
14:40 UTC [RESPONSE] On-call engineer acknowledges alert
14:45 UTC [DIAGNOSIS] Engineer identifies TLS handshake failures in logs
14:52 UTC [ESCALATION] Platform team paged for certificate access
15:10 UTC [MITIGATION] New certificate issued and deployed to staging
15:15 UTC [MITIGATION] Certificate deployed to production
15:19 UTC [RESOLUTION] Error rates return to baseline, incident closed
Quantify impact with actual numbers, not vague language:
List every factor that contributed to the incident occurring or lasting longer than it should have. Frame these as system failures, not personal failures.
- Certificate expiry was tracked in a spreadsheet with no automated alerting
- The payment service had no fallback path when TLS negotiation fails
- Runbook for certificate rotation was last updated 18 months ago and
referenced a deprecated tool
- On-call engineer did not have permissions to rotate certificates,
requiring escalation
Identify the deepest systemic cause. The root cause is never "someone made a mistake" — it is the system condition that allowed the mistake to have impact.
Root cause: Certificate lifecycle management relied on manual tracking without
automated expiry alerts or rotation. The system had no defense against expiry
because it was treated as a one-time setup rather than an ongoing concern.
Every action item must have an owner, a deadline, and a priority. Use this format:
| Priority | Action Item | Owner | Deadline | Ticket |
|---|---|---|---|---|
| P0 | Add automated certificate expiry alerting (30/14/7 day warnings) | @platform-team | 2024-03-19 | OPS-891 |
| P1 | Implement certificate auto-rotation for payment service | @platform-team | 2024-04-01 | OPS-892 |
| P1 | Grant on-call engineers certificate rotation permissions | @security-team | 2024-03-15 | SEC-234 |
| P2 | Add TLS handshake failure to checkout service health check | @checkout-team | 2024-04-15 | CHK-567 |
Priority definitions: P0 — before next on-call rotation, prevents recurrence. P1 — within 2 weeks, reduces severity or detection time. P2 — within 30 days, improves resilience or observability.
Include three categories:
Before delivering the postmortem, verify:
Write structured accessibility audit reports with findings mapped to WCAG criteria, severity levels, affected components, remediation steps, and a prioritized fix timeline.
Write effective prompts for AI coding tools — structure task context, specify constraints, provide examples, and iterate based on output quality. Targets developers using Claude Code, Copilot, Cursor, and similar tools.
Design RESTful and GraphQL APIs with consistent naming conventions, versioning strategy, pagination patterns, error response formats, authentication schemes, and rate limiting — producing a complete API specification.
Create REST and GraphQL integration guides covering authentication flows, pagination strategies, rate limiting, error handling, retry logic, and SDK wrapper patterns. Produces implementation-ready reference documents with code examples.
Write Architecture Decision Records (ADRs) that capture context, options considered, decision rationale, and consequences — creating a searchable decision log for future engineers.
Design and document automation workflows for n8n, Make, Zapier, and Power Automate. Covers trigger-action chain mapping, conditional branching, error handling, retry logic, and workflow diagrams with platform-specific node references.