| name | chaos-engineering-resilience |
| description | Chaos engineering principles, controlled failure injection, resilience testing, and system recovery validation. Use when testing distributed systems, building confidence in fault tolerance, or validating disaster recovery. |
| category | specialized-testing |
| priority | high |
| tokenEstimate | 900 |
| agents | ["qe-chaos-engineer","qe-performance-tester","qe-production-intelligence"] |
| implementation_status | optimized |
| optimization_version | 1 |
| last_optimized | "2025-12-02T00:00:00.000Z" |
| dependencies | [] |
| quick_reference_card | true |
| tags | ["chaos","resilience","fault-injection","distributed-systems","recovery","netflix"] |
Chaos Engineering & Resilience Testing
<default_to_action>
When testing system resilience or injecting failures:
- DEFINE steady state (normal metrics: error rate, latency, throughput)
- HYPOTHESIZE system continues in steady state during failure
- INJECT real-world failures (network, instance, disk, CPU)
- OBSERVE and measure deviation from steady state
- FIX weaknesses discovered, document runbooks, repeat
Quick Chaos Steps:
- Start small: Dev ā Staging ā 1% prod ā gradual rollout
- Define clear rollback triggers (error_rate > 5%)
- Measure blast radius, never exceed planned scope
- Document findings ā runbooks ā improved resilience
Critical Success Factors:
- Controlled experiments with automatic rollback
- Steady state must be measurable
- Start in non-production, graduate to production
</default_to_action>
Quick Reference Card
When to Use
- Distributed systems validation
- Disaster recovery testing
- Building confidence in fault tolerance
- Pre-production resilience verification
Failure Types to Inject
| Category | Failures | Tools |
|---|
| Network | Latency, packet loss, partition | tc, toxiproxy |
| Infrastructure | Instance kill, disk failure, CPU | Chaos Monkey |
| Application | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| Dependencies | Service outage, timeout | WireMock |
Blast Radius Progression
Dev (safe) ā Staging ā 1% prod ā 10% ā 50% ā 100%
ā ā ā ā
Learn Validate Careful Full confidence
Steady State Metrics
| Metric | Normal | Alert Threshold |
|---|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |
Chaos Experiment Structure
const experiment = {
name: 'Database latency injection',
hypothesis: 'System handles 500ms DB latency gracefully',
steadyState: {
errorRate: '< 0.1%',
p99Latency: '< 300ms'
},
method: {
type: 'network-latency',
target: 'database',
delay: '500ms',
duration: '5m'
},
rollback: {
automatic: true,
trigger: 'errorRate > 5%'
}
};
Agent-Driven Chaos
await Task("Chaos Experiment", {
target: 'payment-service',
failure: 'terminate-random-instance',
blastRadius: '10%',
duration: '5m',
steadyStateHypothesis: {
metric: 'success-rate',
threshold: 0.99
},
autoRollback: true
}, "qe-chaos-engineer");
Agent Coordination Hints
Memory Namespace
aqe/chaos-engineering/
āāā experiments/* - Experiment definitions & results
āāā steady-states/* - Baseline measurements
āāā runbooks/* - Generated recovery procedures
āāā blast-radius/* - Impact analysis
Fleet Coordination
const chaosFleet = await FleetManager.coordinate({
strategy: 'chaos-engineering',
agents: [
'qe-chaos-engineer',
'qe-performance-tester',
'qe-production-intelligence'
],
topology: 'sequential'
});
Related Skills
Remember
Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.
With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.