// Production resilience patterns with circuit breakers, retry strategies, bulkheads, timeouts, graceful degradation, health checks, and chaos engineering for fault-tolerant distributed systems.
| name | qa-resilience |
| description | Production resilience patterns with circuit breakers, retry strategies, bulkheads, timeouts, graceful degradation, health checks, and chaos engineering for fault-tolerant distributed systems. |
This skill provides execution-ready patterns for building resilient, fault-tolerant systems that handle failures gracefully. Claude should apply these patterns when users need error handling strategies, circuit breakers, retry logic, or production hardening.
Modern Best Practices (2025): Circuit breaker pattern, exponential backoff, bulkhead isolation, timeout policies, graceful degradation, health check design, chaos engineering, and observability-driven resilience.
Claude should invoke this skill when a user requests:
| Pattern | Library/Tool | When to Use | Configuration |
|---|---|---|---|
| Circuit Breaker | Opossum (Node.js), pybreaker (Python) | External API calls, database connections | Threshold: 50%, timeout: 30s, volume: 10 |
| Retry with Backoff | p-retry (Node.js), tenacity (Python) | Transient failures, rate limits | Max retries: 5, exponential backoff with jitter |
| Bulkhead Isolation | Semaphore pattern, thread pools | Prevent resource exhaustion | Pool size based on workload (CPU cores + wait/service time) |
| Timeout Policies | AbortSignal, statement timeout | Slow dependencies, database queries | Connection: 5s, API: 10-30s, DB query: 5-10s |
| Graceful Degradation | Feature flags, cached fallback | Non-critical features, ML recommendations | Cache recent data, default values, reduced functionality |
| Health Checks | Kubernetes probes, /health endpoints | Service orchestration, load balancing | Liveness: simple, readiness: dependency checks, startup: slow apps |
| Chaos Engineering | Chaos Toolkit, Netflix Chaos Monkey | Proactive resilience testing | Start non-prod, define hypothesis, automate failure injection |
Failure scenario: [System Dependency Type]
├─ External API/Service?
│ ├─ Transient errors? → Retry with exponential backoff + jitter
│ ├─ Cascading failures? → Circuit breaker + fallback
│ ├─ Rate limiting? → Retry with Retry-After header respect
│ └─ Slow response? → Timeout + circuit breaker
│
├─ Database Dependency?
│ ├─ Connection pool exhaustion? → Bulkhead isolation + timeout
│ ├─ Query timeout? → Statement timeout (5-10s)
│ ├─ Replica lag? → Read from primary fallback
│ └─ Connection failures? → Retry + circuit breaker
│
├─ Non-Critical Feature?
│ ├─ ML recommendations? → Feature flag + default values fallback
│ ├─ Search service? → Cached results or basic SQL fallback
│ ├─ Email/notifications? → Log error, don't block main flow
│ └─ Analytics? → Fire-and-forget, circuit breaker for protection
│
├─ Kubernetes/Orchestration?
│ ├─ Service discovery? → Liveness + readiness + startup probes
│ ├─ Slow startup? → Startup probe (failureThreshold: 30)
│ ├─ Load balancing? → Readiness probe (check dependencies)
│ └─ Auto-restart? → Liveness probe (simple check)
│
└─ Testing Resilience?
├─ Pre-production? → Chaos Toolkit experiments
├─ Production (low risk)? → Feature flags + canary deployments
├─ Scheduled testing? → Game days (quarterly)
└─ Continuous chaos? → Netflix Chaos Monkey (1% failure injection)
Circuit Breaker Patterns - Prevent cascading failures
Retry Patterns - Handle transient failures
Bulkhead Isolation - Resource compartmentalization
Timeout Policies - Prevent resource exhaustion
Graceful Degradation - Maintain partial functionality
Health Check Patterns - Service availability monitoring
Resilience Checklists - Production hardening checklists
Chaos Engineering Guide - Safe reliability experiments
Resilience Runbook Template - Service hardening profile
Fault Injection Playbook - Chaos testing script
| Scenario | Recommendation |
|---|---|
| External API calls | Circuit breaker + retry with exponential backoff |
| Database queries | Timeout + connection pooling + circuit breaker |
| Slow dependency | Bulkhead isolation + timeout |
| Non-critical feature | Feature flag + graceful degradation |
| Kubernetes deployment | Liveness + readiness + startup probes |
| Testing resilience | Chaos engineering experiments |
| Transient failures | Retry with exponential backoff + jitter |
| Cascading failures | Circuit breaker + bulkhead |
Pattern Selection:
Observability:
Testing:
Success Criteria: Systems gracefully handle failures, recover automatically, maintain partial functionality during outages, and fail fast to prevent cascading failures. Resilience is tested proactively through chaos engineering.