with one click
reliability-improvement-plan
// Identify single points of failure, assess recovery capabilities, and produce a prioritized remediation plan aligned with the Well-Architected Reliability pillar.
// Identify single points of failure, assess recovery capabilities, and produce a prioritized remediation plan aligned with the Well-Architected Reliability pillar.
Generate a Well-Architected-aligned Architecture Decision Record (ADR) that documents a design decision with context, options evaluated, trade-offs, and WA pillar impact.
Analyze an AWS architecture for cost waste, right-sizing opportunities, and pricing model improvements aligned with the Well-Architected Cost Optimization pillar.
Assess a workload's readiness to migrate to AWS using Well-Architected principles, covering the 7 Rs, dependencies, risks, and a migration plan.
Assess a workload's operational excellence posture against the Well-Architected Operational Excellence pillar, covering organization, preparation, operation, and evolution. Use this skill when evaluating CI/CD practices, observability, incident management, runbook coverage, or operational maturity.
Evaluate a workload's performance efficiency against the Well-Architected Performance Efficiency pillar, covering resource selection, scaling, monitoring, and optimization opportunities.
Deep-dive security posture assessment against the Well-Architected Security pillar, covering identity, detection, infrastructure protection, data protection, and incident response.
| name | reliability-improvement-plan |
| description | Identify single points of failure, assess recovery capabilities, and produce a prioritized remediation plan aligned with the Well-Architected Reliability pillar. |
| version | 1.1.0 |
Ask the user:
What workload would you like me to assess for reliability? Please share:
- Architecture overview (services, regions, AZs, dependencies)
- Availability target (99.9%, 99.95%, 99.99%, etc.)
- Recovery objectives (RTO and RPO if defined)
- Past incidents (optional — recent outages or near-misses)
If context is already provided, proceed directly.
For each component, ask: "What happens if this fails?"
Classify each SPOF by severity:
Check for:
If the workload is a data pipeline (S3/Lambda/Step Functions/Glue/EMR/Kinesis/Kafka/Redshift):
Evaluate:
Assess:
Evaluate:
Output:
# Reliability Improvement Plan: {Workload Name}
## Summary
- **Date**: {date}
- **Availability target**: {target}
- **Estimated current availability**: {estimate}
- **RTO**: {current} → {target}
- **RPO**: {current} → {target}
- **Findings**: {X} High Risk, {Y} Medium Risk, {Z} Low Risk
## Reliability Scorecard
| Domain | Score (1-5) | Key Gap |
|--------|-------------|---------|
| Fault Tolerance | {score} | {gap} |
| Recovery & Backup | {score} | {gap} |
| Scaling & Capacity | {score} | {gap} |
| Change Management | {score} | {gap} |
| Testing & Validation | {score} | {gap} |
## Single Points of Failure
| Component | Severity | Failure Impact | Current Mitigation | Gap | AWS Service to Fix |
|-----------|----------|---------------|-------------------|-----|-------------------|
| {component} | 🔴/🟡/🟢 | {impact} | {mitigation or "None"} | {what's missing} | {service} |
## High Risk Findings
{Each: SPOF description, blast radius, recommendation, AWS services, effort}
## Remediation Plan
### Quick Wins (< 1 week)
{Low-effort high-impact: enable backups, turn on Multi-AZ, add health checks}
### Foundation (1-4 weeks)
{Multi-AZ compute, auto-scaling, circuit breakers, deployment safety}
### Advanced (1-3 months)
{Multi-region, chaos engineering, automated failover drills}
## Architecture Recommendations
{Specific changes: multi-AZ, read replicas, circuit breakers, async patterns, etc.}
## Testing Plan
| Test | What it validates | Frequency | AWS Service |
|------|-------------------|-----------|-------------|
| AZ failover drill | Compute continues in remaining AZs | Monthly | FIS |
| Database failover | RDS/Aurora failover < 60s | Quarterly | FIS |
| Load test | Capacity handles 2x peak | Before releases | Distributed Load Testing |
| Backup restore | RPO is met, data is recoverable | Monthly | AWS Backup |
| Deployment rollback | Bad deploy is reverted < 5 min | Every deploy | CodeDeploy |
## Next Steps
{Concrete actions the team should take this week}
After delivering the plan, offer:
Would you like me to:
- Design the multi-AZ architecture in detail?
- Create a chaos engineering experiment plan using AWS FIS?
- Build a failover testing runbook?
- Estimate the cost of the reliability improvements?
- Design circuit breaker patterns for your service dependencies?