// Production resilience patterns with circuit breakers, retry strategies, bulkheads, timeouts, graceful degradation, health checks, and chaos engineering for fault-tolerant distributed systems.
| name | qa-resilience |
| description | Production resilience patterns with circuit breakers, retry strategies, bulkheads, timeouts, graceful degradation, health checks, and chaos engineering for fault-tolerant distributed systems. |
This skill provides execution-ready patterns for building resilient, fault-tolerant systems that handle failures gracefully. Claude should apply these patterns when users need error handling strategies, circuit breakers, retry logic, or production hardening.
Modern Best Practices (2025): Circuit breaker pattern, exponential backoff, bulkhead isolation, timeout policies, graceful degradation, health check design, chaos engineering, and observability-driven resilience.
Claude should invoke this skill when a user requests:
| Pattern | Library/Tool | When to Use | Configuration |
|---|---|---|---|
| Circuit Breaker | Opossum (Node.js), pybreaker (Python) | External API calls, database connections | Threshold: 50%, timeout: 30s, volume: 10 |
| Retry with Backoff | p-retry (Node.js), tenacity (Python) | Transient failures, rate limits | Max retries: 5, exponential backoff with jitter |
| Bulkhead Isolation | Semaphore pattern, thread pools | Prevent resource exhaustion | Pool size based on workload (CPU cores + wait/service time) |
| Timeout Policies | AbortSignal, statement timeout | Slow dependencies, database queries | Connection: 5s, API: 10-30s, DB query: 5-10s |
| Graceful Degradation | Feature flags, cached fallback | Non-critical features, ML recommendations | Cache recent data, default values, reduced functionality |
| Health Checks | Kubernetes probes, /health endpoints | Service orchestration, load balancing | Liveness: simple, readiness: dependency checks, startup: slow apps |
| Chaos Engineering | Chaos Toolkit, Netflix Chaos Monkey | Proactive resilience testing | Start non-prod, define hypothesis, automate failure injection |
Failure scenario: [System Dependency Type]
โโ External API/Service?
โ โโ Transient errors? โ Retry with exponential backoff + jitter
โ โโ Cascading failures? โ Circuit breaker + fallback
โ โโ Rate limiting? โ Retry with Retry-After header respect
โ โโ Slow response? โ Timeout + circuit breaker
โ
โโ Database Dependency?
โ โโ Connection pool exhaustion? โ Bulkhead isolation + timeout
โ โโ Query timeout? โ Statement timeout (5-10s)
โ โโ Replica lag? โ Read from primary fallback
โ โโ Connection failures? โ Retry + circuit breaker
โ
โโ Non-Critical Feature?
โ โโ ML recommendations? โ Feature flag + default values fallback
โ โโ Search service? โ Cached results or basic SQL fallback
โ โโ Email/notifications? โ Log error, don't block main flow
โ โโ Analytics? โ Fire-and-forget, circuit breaker for protection
โ
โโ Kubernetes/Orchestration?
โ โโ Service discovery? โ Liveness + readiness + startup probes
โ โโ Slow startup? โ Startup probe (failureThreshold: 30)
โ โโ Load balancing? โ Readiness probe (check dependencies)
โ โโ Auto-restart? โ Liveness probe (simple check)
โ
โโ Testing Resilience?
โโ Pre-production? โ Chaos Toolkit experiments
โโ Production (low risk)? โ Feature flags + canary deployments
โโ Scheduled testing? โ Game days (quarterly)
โโ Continuous chaos? โ Netflix Chaos Monkey (1% failure injection)
Circuit Breaker Patterns - Prevent cascading failures
Retry Patterns - Handle transient failures
Bulkhead Isolation - Resource compartmentalization
Timeout Policies - Prevent resource exhaustion
Graceful Degradation - Maintain partial functionality
Health Check Patterns - Service availability monitoring
Resilience Checklists - Production hardening checklists
Chaos Engineering Guide - Safe reliability experiments
Resilience Runbook Template - Service hardening profile
Fault Injection Playbook - Chaos testing script
| Scenario | Recommendation |
|---|---|
| External API calls | Circuit breaker + retry with exponential backoff |
| Database queries | Timeout + connection pooling + circuit breaker |
| Slow dependency | Bulkhead isolation + timeout |
| Non-critical feature | Feature flag + graceful degradation |
| Kubernetes deployment | Liveness + readiness + startup probes |
| Testing resilience | Chaos engineering experiments |
| Transient failures | Retry with exponential backoff + jitter |
| Cascading failures | Circuit breaker + bulkhead |
Pattern Selection:
Observability:
Testing:
Success Criteria: Systems gracefully handle failures, recover automatically, maintain partial functionality during outages, and fail fast to prevent cascading failures. Resilience is tested proactively through chaos engineering.