| name | incident-runbook-templates |
| description | Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures. |
Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
When to Use This Skill
- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers
Core Concepts
1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
2. Runbook Structure
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
Runbook Templates
Template 1: Service Outage Runbook
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
2. Quick Health Checks
3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
Mitigation Procedures
4.1 Service Completely Down
kubectl get pods -n payments
kubectl logs -n payments -l app=payment-service --tail=100
kubectl rollout history deployment/payment-service -n payments
kubectl rollout undo deployment/payment-service -n payments
kubectl scale deployment/payment-service -n payments --replicas=10
kubectl rollout status deployment/payment-service -n payments
4.2 High Latency
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
4.3 Partial Failures (Specific Errors)
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
4.4 Traffic Surge
kubectl top pods -n payments
kubectl scale deployment/payment-service -n payments --replicas=20
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range
EOF
Verification Steps
curl -s https://api.company.com/payments/health | jq
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
./scripts/smoke-test-payments.sh
Rollback Procedures
kubectl rollout undo deployment/payment-service -n payments
./scripts/db-rollback.sh $MIGRATION_VERSION
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
Escalation Matrix
| Condition | Escalate To | Contact |
|---|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
Communication Templates
Initial Notification (Internal)
๐จ INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards
Updates in #payments-incidents
Status Update
๐ UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
- Rolled back deployment v2.3.4 โ v2.3.3
- Scaled service from 5 โ 10 replicas
Next Steps:
- Continuing to monitor
- Root cause analysis in progress
ETA to Resolution: ~15 minutes
Resolution Notification
โ
RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully
Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
### Template 2: Database Incident Runbook
```markdown
# Database Incident Runbook
## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
Replication Lag
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
Disk Space Critical
df -h /var/lib/postgresql/data
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
psql -c "VACUUM FULL large_table;"
## Best Practices
### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress
### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident
## Resources
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)