// Deployment procedures, health checks, and rollback strategies. Use this skill when deploying applications, performing health checks, managing releases, or handling deployment failures. Provides systematic deployment workflows, verification scripts, and troubleshooting guides. Complements the devops-automation agent.
| name | deployment-runbook |
| description | Deployment procedures, health checks, and rollback strategies. Use this skill when deploying applications, performing health checks, managing releases, or handling deployment failures. Provides systematic deployment workflows, verification scripts, and troubleshooting guides. Complements the devops-automation agent. |
This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.
Before any production deployment:
Best for: Zero-downtime deployments, easy rollbacks
Process:
Commands:
# Deploy to green environment
./deploy.sh --env green
# Run health checks
python3 scripts/health_check.py --env green
# Switch traffic (gradual)
./switch_traffic.sh --from blue --to green --percentage 10
./switch_traffic.sh --from blue --to green --percentage 50
./switch_traffic.sh --from blue --to green --percentage 100
# If issues: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100
Best for: Risk-averse deployments, gradual rollouts
Process:
Monitoring During Canary:
Best for: Standard updates, resource-constrained environments
Process:
# 1. Verify staging environment
./verify_staging.sh
# 2. Create deployment tag
git tag -a v1.2.3 -m "Release 1.2.3"
git push origin v1.2.3
# 3. Trigger production build
./build_production.sh --tag v1.2.3
# 4. Backup database
./backup_db.sh --environment production
# 5. Notify team
./notify_slack.sh "๐ Starting deployment v1.2.3 in 30 minutes"
# 1. Enable maintenance mode (if needed)
./maintenance_mode.sh --enable
# 2. Run database migrations
./run_migrations.sh --environment production
# 3. Deploy application
./deploy.sh --environment production --version v1.2.3
# 4. Disable maintenance mode
./maintenance_mode.sh --disable
# Run comprehensive health checks
python3 scripts/health_check.py --environment production
# Expected output:
# โ API health endpoint responding
# โ Database connectivity OK
# โ Cache layer accessible
# โ External services reachable
# โ Error rate within threshold
# โ Response time within SLA
Monitor these metrics:
Application Metrics:
Infrastructure Metrics:
Business Metrics:
Rollback immediately if:
# Blue-green: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100
# Verification
python3 scripts/health_check.py --environment production
# Deploy previous version
./deploy.sh --environment production --version v1.2.2
# Run health checks
python3 scripts/health_check.py --environment production
# If migrations were applied
./rollback_migration.sh --environment production --steps 1
# Restore from backup (last resort)
./restore_db.sh --backup latest --environment production
Verify system health
python3 scripts/health_check.py --environment production
Notify stakeholders
./notify_slack.sh "โ ๏ธ Deployment v1.2.3 rolled back. System stable on v1.2.2"
Create postmortem
Use the included health check script:
# Run all checks
python3 scripts/health_check.py --env production
# Run specific check
python3 scripts/health_check.py --env production --check api
# Verbose output
python3 scripts/health_check.py --env production --verbose
See scripts/health_check.py for implementation.
Symptoms:
Diagnosis:
# Check service logs
kubectl logs -f deployment/app-name
# Check events
kubectl get events --sort-by='.lastTimestamp'
Resolution:
Symptoms:
Diagnosis:
# Check application logs
tail -f /var/log/app/error.log
# Check error distribution
grep "ERROR" /var/log/app/* | awk '{print $NF}' | sort | uniq -c | sort -nr
Resolution:
Symptoms:
Diagnosis:
# Test database connectivity
python3 scripts/test_db_connection.py
# Check connection pool
psql -h db-host -U user -c "SELECT * FROM pg_stat_activity;"
Resolution:
๐ **Production Deployment Scheduled**
**Version**: v1.2.3
**Time**: 2024-01-15 14:00 UTC (30 minutes)
**Duration**: ~15 minutes
**Impact**: No expected downtime
**Changes**:
- Feature: New user dashboard
- Fix: Payment processing bug
- Performance: API response time improvements
**Rollback Plan**: Blue-green switch (instant)
**On-Call**: @engineer-name
โ
**Deployment Complete**
**Version**: v1.2.3
**Status**: Successful
**Duration**: 12 minutes
**Health Checks**: All passing โ
**Metrics**: Within normal range
**Next Check**: T+30 minutes
Monitoring dashboard: [link]
โ ๏ธ **Deployment Rolled Back**
**Version**: v1.2.3 โ v1.2.2 (rollback)
**Reason**: Elevated error rate (2.1%)
**Status**: System stable on v1.2.2
**Action Items**:
- [ ] Root cause analysis
- [ ] Fix identified issue
- [ ] Re-test in staging
- [ ] Schedule re-deployment
Incident report: [link]
Emergency Rollback:
./switch_traffic.sh --from green --to blue --percentage 100
Health Check:
python3 scripts/health_check.py --env production
View Logs:
kubectl logs -f deployment/app-name --tail=100
Check Metrics:
curl https://metrics.example.com/api/health