// Guides production deployment workflow with safety checks and rollback procedures. Use when deploying applications to staging or production environments.
| name | deployment-workflow |
| description | Guides production deployment workflow with safety checks and rollback procedures. Use when deploying applications to staging or production environments. |
| version | 1.0.0 |
| author | Platform Team |
| category | custom |
| token_estimate | ~3500 |
<when_to_use> Use this skill when:
Do NOT use this skill when:
Verify all prerequisites are met before starting deployment:
Code Readiness:
# Verify CI/CD pipeline passed
gh run list --branch main --limit 1 --json status,conclusion
# Expected: status=completed, conclusion=success
Staging Validation:
# Check staging deployment status
kubectl get deployment -n staging
kubectl get pods -n staging | grep -v Running
# Should see all pods Running, no errors
Infrastructure Health:
# Verify production cluster health
kubectl cluster-info
kubectl get nodes
kubectl top nodes
# All nodes should be Ready with reasonable resource usage
Checklist:
Set up monitoring and prepare rollback resources:
1. Create Deployment Tracking:
# Create deployment tracking issue or ticket
# Document: version being deployed, key changes, rollback steps
2. Set Up Monitoring Dashboard:
# Open monitoring dashboards:
# - Application metrics (latency, error rate, throughput)
# - Infrastructure metrics (CPU, memory, disk)
# - Business metrics (user activity, transaction success rate)
3. Notify Team:
# Post in team channel:
# "๐ Starting production deployment of [service-name] v[version]
# Changes: [brief description]
# ETA: [estimated time]
# Monitoring: [dashboard link]"
4. Verify Rollback Resources:
# Confirm previous version artifacts are available
docker pull your-registry/service-name:previous-version
# Verify database backups are recent
# Check that rollback procedures are accessible
Execute Deployment
Deploy using your deployment method (examples provided for common scenarios):
Kubernetes Rolling Update:
# Update image tag in deployment
kubectl set image deployment/service-name \
service-name=your-registry/service-name:new-version \
-n production
# Monitor rollout
kubectl rollout status deployment/service-name -n production
# Watch pods coming up
kubectl get pods -n production -l app=service-name -w
Blue-Green Deployment:
# Deploy green version
kubectl apply -f deployment-green.yaml -n production
# Wait for green to be ready
kubectl wait --for=condition=ready pod \
-l app=service-name,version=green \
-n production \
--timeout=300s
# Switch traffic to green
kubectl patch service service-name -n production \
-p '{"spec":{"selector":{"version":"green"}}}'
# Monitor for 5-10 minutes before cleaning up blue
Canary Deployment:
# Deploy canary with 10% traffic
kubectl apply -f deployment-canary.yaml -n production
# Monitor canary metrics for 10-15 minutes
# Compare error rates, latency between canary and stable
# If healthy, gradually increase canary traffic
kubectl scale deployment service-name-canary \
--replicas=3 -n production # 30% traffic
# Continue monitoring and scaling until full rollout
Important Considerations:
Verify the deployment succeeded and system is healthy:
1. Health Checks:
# Verify all pods are running
kubectl get pods -n production -l app=service-name
# Check application health endpoint
curl https://api.example.com/health
# Expected response: {"status": "healthy", "version": "new-version"}
2. Smoke Tests:
# Run critical path tests
curl -X POST https://api.example.com/api/v1/users \
-H "Content-Type: application/json" \
-d '{"name": "test", "email": "test@example.com"}'
# Verify key functionality works
# Test authentication, critical endpoints, integrations
3. Metrics Validation:
Monitor for at least 15 minutes:
4. Log Analysis:
# Check for errors in application logs
kubectl logs -n production -l app=service-name \
--since=15m | grep -i error
# Review any warning or error patterns
Validation Checklist:
Finalize deployment and communicate results:
1. Update Documentation:
# Update deployment tracking with results
# Document any issues encountered
# Note any configuration changes made
2. Notify Team:
# Post completion message:
# "โ
Production deployment of [service-name] v[version] complete
# Status: Success
# Metrics: [brief summary]
# Issues: None / [describe any issues]"
3. Clean Up (if applicable):
# Remove old blue environment (blue-green deployment)
kubectl delete deployment service-name-blue -n production
# Scale down canary (canary deployment)
kubectl delete deployment service-name-canary -n production
4. Schedule Follow-up:
<best_practices>
Deploy During Low-Traffic PeriodsRationale: Reduces impact if issues occur and makes anomaly detection easier.
Implementation:
Rationale: Allows instant rollback of feature behavior without code deployment.
Example:
# In application code
if feature_flags.is_enabled('new_algorithm'):
result = new_algorithm(data)
else:
result = legacy_algorithm(data)
Disable flag instantly if issues arise, no deployment needed.
Gradual Rollout StrategyRationale: Limits blast radius if issues occur.
Implementation:
Medium Freedom: Core safety steps must be followed (pre-deployment checks, monitoring, validation), but deployment method can be adapted based on:
This skill uses approximately 3,500 tokens when fully loaded.
Optimization Strategy:
</best_practices>
<common_pitfalls> Skipping Pre-Deployment Checks
What Happens: Deployment proceeds with failing tests or unhealthy staging environment, leading to production incidents.
Why It Happens: Pressure to deploy quickly, confidence in changes, or assumption that issues are minor.
How to Avoid:
Recovery: If deployed without checks and issues arise, immediately roll back and perform full verification before re-deploying.
Insufficient Monitoring During DeploymentWhat Happens: Issues go undetected until users report problems, making diagnosis harder and recovery slower.
Why It Happens: Assuming deployment will succeed, distractions, or lack of monitoring setup.
How to Avoid:
Warning Signs:
What Happens: When issues occur, team scrambles to figure out how to recover, prolonging the incident.
Why It Happens: Optimism bias, time pressure, or lack of experience with rollbacks.
How to Avoid:
Recovery: If issues occur without rollback plan:
</common_pitfalls>
Standard Kubernetes Rolling UpdateContext: Deploying a new version of a stateless API service to production with low-risk changes (bug fixes, minor improvements).
Situation:
Steps:
# Verify CI passed
gh run view --repo company/user-api
# Check staging
kubectl get deployment user-api -n staging
# Output: user-api 3/3 3 3 2h
# Verify staging health
curl https://staging.api.example.com/health
# Output: {"status": "healthy", "version": "v2.4.0"}
# Open Datadog/Grafana dashboard
open https://monitoring.example.com/dashboards/user-api
# Post to Slack
slack post #deployments "๐ Deploying user-api v2.4.0 to production. ETA: 10min"
# Update deployment
kubectl set image deployment/user-api \
user-api=registry.example.com/user-api:v2.4.0 \
-n production
# Monitor rollout
kubectl rollout status deployment/user-api -n production
# Output: deployment "user-api" successfully rolled out
# Check pods
kubectl get pods -n production -l app=user-api
# All pods should show Running status
# Health check
curl https://api.example.com/health
# Output: {"status": "healthy", "version": "v2.4.0"}
# Check metrics (wait 15 minutes)
# - Error rate: 0.3% (was 0.4%, improved โ)
# - Latency p95: 180ms (was 220ms, improved โ)
# - Throughput: ~1000 req/min (stable โ)
slack post #deployments "โ
user-api v2.4.0 deployed successfully. Metrics looking good."
Expected Output:
Deployment successful
- Version: v2.4.0
- Pods: 5/5 running
- Health: All checks passed
- Metrics: Stable/Improved
- Duration: 8 minutes
Outcome: Deployment completed smoothly, performance improved as expected, no issues reported.
Blue-Green Deployment with Database MigrationContext: Deploying a major feature that requires database schema changes, using blue-green strategy to minimize downtime and enable fast rollback.
Situation:
Challenges:
Steps:
# Verify tests passed
gh run view --repo company/payment-service
# All checks passed: unit (850 tests), integration (120 tests), e2e (45 tests)
# Validate staging thoroughly
curl -X POST https://staging.api.example.com/api/v1/payments \
-H "Authorization: Bearer $STAGING_TOKEN" \
-d '{"method": "new_payment_type", "amount": 100}'
# Success: payment processed with new method
# Check database migration in staging
kubectl exec -n staging payment-service-db -it -- \
psql -U app -c "\d payment_methods"
# New tables exist and are populated
# Apply migration (backward compatible, blue can still run)
kubectl apply -f migration-job.yaml -n production
kubectl wait --for=condition=complete job/payment-migration -n production
# Verify migration
kubectl logs job/payment-migration -n production
# Output: Migration completed successfully. 3 tables added, 0 rows migrated.
# Deploy green environment
kubectl apply -f deployment-green-v3.2.0.yaml -n production
# Wait for green to be ready
kubectl wait --for=condition=ready pod \
-l app=payment-service,version=green \
-n production --timeout=600s
# Test green directly (before traffic switch)
kubectl port-forward -n production \
svc/payment-service-green 8080:80 &
curl http://localhost:8080/health
# Output: {"status": "healthy", "version": "v3.2.0"}
curl -X POST http://localhost:8080/api/v1/payments \
-H "Authorization: Bearer $TEST_TOKEN" \
-d '{"method": "new_payment_type", "amount": 100}'
# Success: payment processed
# Kill port-forward
kill %1
# Post warning
slack post #deployments "โ ๏ธ Switching payment-service traffic to v3.2.0. Monitoring closely."
# Switch service selector to green
kubectl patch service payment-service -n production \
-p '{"spec":{"selector":{"version":"green"}}}'
# Traffic now going to green
# Monitor intensively for 15 minutes
# Check metrics every 2-3 minutes for 15 minutes
# - Error rate: 0.1% (was 0.1%, stable โ)
# - Latency p95: 150ms (was 145ms, acceptable โ)
# - Latency p99: 300ms (was 280ms, acceptable โ)
# - Payment success rate: 99.4% (was 99.5%, within tolerance โ)
# - New payment method usage: 12 transactions (working โ)
# Check logs for any errors
kubectl logs -n production -l app=payment-service,version=green \
--since=15m | grep -i error
# No critical errors found
# After 30 minutes of stable operation, remove blue
kubectl delete deployment payment-service-blue -n production
slack post #deployments "โ
payment-service v3.2.0 fully deployed. New payment methods active. Blue environment cleaned up."
Expected Output:
Blue-Green Deployment Success
- Green version: v3.2.0
- Migration: Completed successfully
- Traffic switch: Seamless (no downtime)
- Validation period: 30 minutes
- Metrics: Stable
- Blue cleanup: Completed
- Total duration: 45 minutes
Outcome: Complex deployment with database changes completed successfully. New payment methods working. Zero downtime. Blue kept around for 30 minutes as safety net, then cleaned up.
Emergency Rollback During CanaryContext: Canary deployment detects issues; immediate rollback required.
Situation:
Steps:
# Monitoring shows elevated errors in canary
# Error rate: Canary 5.2%, Stable 0.4%
# Decision: Rollback immediately
# Scale down canary to 0
kubectl scale deployment recommendation-engine-canary \
--replicas=0 -n production
# Verify stable handling 100% traffic
kubectl get deployment -n production
# recommendation-engine: 10/10 ready (stable)
# recommendation-engine-canary: 0/0 ready (scaled down)
# Check error rate
# After 2 minutes: Error rate back to 0.4%
# Collect logs from canary
kubectl logs -n production -l app=recommendation-engine,version=canary \
--since=30m > canary-failure-logs.txt
# Post incident
slack post #incidents "โ ๏ธ Rollback: recommendation-engine v4.1.0 canary showed 5% error rate. Rolled back to v4.0.3. Investigating."
# Create incident ticket
# Document error patterns, affected requests, timeline
# Analyze logs
grep "ERROR" canary-failure-logs.txt | head -20
# Pattern: "NullPointerException in UserPreference.getHistory()"
# Finding: New code didn't handle missing user history gracefully
# Fix needed: Add null check before accessing user history
Expected Output:
Rollback Successful
- Detection time: 8 minutes into canary
- Rollback execution: 30 seconds
- Service recovery: 2 minutes
- Affected traffic: ~20% for 8 minutes
- Root cause: Found within 1 hour
- Fix: Deployed v4.1.1 next day after testing
Outcome: Quick detection and rollback prevented widespread issues. Root cause identified. Proper fix deployed after thorough testing. Canary deployment pattern prevented full-scale incident.
Deployment Stuck (Pods Not Coming Up)Symptoms:
kubectl rollout status shows "Waiting for deployment rollout to finish"ImagePullBackOff or CrashLoopBackOffDiagnostic Steps:
# Check pod status
kubectl get pods -n production -l app=service-name
# Describe problematic pod
kubectl describe pod <pod-name> -n production
# Check logs
kubectl logs <pod-name> -n production
Common Causes and Solutions:
1. Image Pull Error:
# Symptom: ImagePullBackOff
# Cause: Wrong image tag or registry auth issue
# Solution: Verify image exists
docker pull your-registry/service-name:version
# Fix: Correct image tag or update registry credentials
kubectl set image deployment/service-name \
service-name=your-registry/service-name:correct-version \
-n production
2. Application Crash:
# Symptom: CrashLoopBackOff
# Cause: Application error on startup
# Solution: Check application logs
kubectl logs <pod-name> -n production --previous
# Common issues:
# - Missing environment variables
# - Database connection failure
# - Configuration error
# Fix: Update configuration and redeploy
kubectl set env deployment/service-name NEW_VAR=value -n production
3. Resource Constraints:
# Symptom: Pods pending, not scheduled
# Cause: Insufficient cluster resources
# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Solution: Scale down other services or add nodes
kubectl scale deployment low-priority-service --replicas=2
Prevention:
Symptoms:
Diagnostic Steps:
Solution:
Immediate:
# If error rate is critical (>5%), rollback immediately
kubectl rollout undo deployment/service-name -n production
# Monitor for recovery
# If errors persist after rollback, issue may be elsewhere
Investigation:
# Analyze error patterns
kubectl logs -n production -l app=service-name \
--since=30m | grep ERROR | sort | uniq -c | sort -rn
# Common patterns:
# - Dependency timeout: Check downstream services
# - Database errors: Check DB health and connections
# - Validation errors: Check request format changes
Alternative Approaches:
Symptoms:
Quick Fix:
# Check migration status
kubectl logs job/migration-name -n production
# Common issues:
# - Lock timeout: Another migration running
# - Syntax error: SQL error in migration
# - Permission denied: Database user lacks permissions
Root Cause Resolution:
1. Lock Timeout:
# Check for long-running queries
# Connect to database and check pg_stat_activity (Postgres)
kubectl exec -it db-pod -n production -- \
psql -U app -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
# Kill blocking query if safe
# Then retry migration
2. Migration Syntax Error:
# Review migration SQL
# Test in staging or local environment
# Fix syntax and redeploy migration
# Rollback if migration partially applied
# Run rollback migration script
3. Permission Issues:
# Grant necessary permissions
kubectl exec -it db-pod -n production -- \
psql -U admin -c "GRANT ALL ON SCHEMA public TO app_user;"
# Retry migration
Prevention:
<related_skills> This skill works well with:
This skill may conflict with:
<integration_notes>
Working with Other ToolsCI/CD Integration: This skill assumes CI/CD has already run tests. For CI/CD setup, reference your platform documentation.
Monitoring Tools: Examples use generic commands. Adapt for your monitoring stack:
Deployment Tools: Examples use kubectl. Adapt for your deployment method:
helm upgrade --installTypical workflow combining multiple skills:
</integration_notes>
- Examples focus on Kubernetes; adapt for other platforms (VMs, serverless, etc.) - Assumes you have monitoring infrastructure set up - Database migration details are brief; use database-migration skill for complex scenarios - Rollback procedures assume stateless services; stateful services require additional considerations - You have access to production environment - Monitoring dashboards are configured - Staging environment mirrors production - Team has agreed-upon deployment windows - Rollback artifacts are retained for reasonable time<version_history>
<additional_resources>
<success_criteria> Deployment is considered successful when:
Pre-Deployment Validation Complete
Deployment Execution Success
Post-Deployment Validation Pass
Monitoring Confirms Stability
Documentation and Communication Complete