| name | production-deploy |
| description | Pre-deployment validation and release management — structured checklists for database migrations, environment variables, rollback plans, backward compatibility, and deployment strategies. Use this skill when the user mentions deploy, release, ship to prod, merge to main, CI/CD pipeline, or says /production deploy. Triggers on deployment-related discussions, release planning, or pre-release validation. |
Production Deploy
This skill encodes the pre-deployment discipline that separates teams who ship confidently from teams who ship and pray. Every checklist item here exists because someone skipped it and caused an outage. The patterns are opinionated and battle-tested — this is not a deployment tutorial, it is a deployment gate.
If you cannot check every box, you are not ready to deploy. Partial deploys are how you get partial outages.
1. Pre-Deployment Checklist
Run through this before every production deploy. No exceptions, no shortcuts, no "we'll check it in staging." Print this out and check the boxes with a pen if that is what it takes.
Data Layer
Application Layer
Infrastructure
Observability
Rollback Readiness
Detection — find deploys that skip the checklist:
gh pr list --state merged --base main --limit 20 --json title,body \
| jq '.[] | select(.body | test("deploy checklist|pre-deploy|rollback plan"; "i") | not) | .title'
git tag -l "rollback-*" --sort=-creatordate | head -5
2. Migration Safety Review
Every migration must be classified before it ships. This is the single most important step in any deploy that touches the database. Get it wrong and you take the service down for every user, not just the ones using the new feature.
Classification
Additive (Safe) — Green Light
These migrations are safe for zero-downtime rolling deploys. Old code ignores the new structures.
- New tables
- New nullable columns (without defaults on large tables)
- New indexes (with
CONCURRENTLY)
- New views
- New functions/procedures
CREATE TABLE notifications (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
message TEXT NOT NULL,
read_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
ALTER TABLE orders ADD COLUMN tracking_url TEXT;
CREATE INDEX CONCURRENTLY ix_notifications_user_id ON notifications (user_id);
Transformative (Risky) — Yellow Light
These require extra review, staging testing with production-volume data, and explicit sign-off. They can be done safely but the safe pattern is non-obvious.
- Column type changes (expand-contract)
- Data backfills on large tables (batch, not single UPDATE)
- Adding NOT NULL constraints (use NOT VALID + VALIDATE)
- Adding CHECK constraints on existing data
- Adding foreign keys on existing data
def upgrade():
conn = op.get_bind()
batch_size = 1000
while True:
result = conn.execute(text(
"UPDATE orders SET status = 'active' "
"WHERE id IN ("
" SELECT id FROM orders WHERE status IS NULL LIMIT :batch"
")"
), {"batch": batch_size})
if result.rowcount == 0:
break
conn.execute(text("COMMIT"))
Destructive (Dangerous) — Red Light
REFUSE to deploy destructive migrations without explicit confirmation from the user. These are irreversible. Suggest a zero-downtime alternative first.
- Column drops
- Table drops
- Data deletion (DELETE/TRUNCATE)
- Column renames (breaks old code during rolling deploy)
- Type changes that narrow data (e.g., TEXT to VARCHAR(50))
def downgrade():
op.add_column('users', sa.Column('legacy_role', sa.String(50)))
Detection — find risky migrations before they ship:
grep -rn "drop_column\|drop_table\|DROP TABLE\|DROP COLUMN\|TRUNCATE\|DELETE FROM" \
alembic/versions/ migrations/
grep -rL "lock_timeout" alembic/versions/*.py
squawk alembic/versions/latest_migration.sql
Migration Lint with squawk
squawk catches unsafe migration patterns automatically. Add it to CI:
- name: Lint migrations
run: |
pip install squawk-cli
# Generate SQL from Alembic migrations
alembic upgrade head --sql > migration.sql
squawk migration.sql
squawk catches: missing CONCURRENTLY on indexes, NOT NULL additions without defaults, missing lock_timeout, and more.
3. Environment Variable Validation
Missing environment variables are the #2 cause of deploy failures (after bad migrations). The fix is simple: validate everything at startup, fail fast with a clear error message.
Python — Pydantic BaseSettings
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import Field, field_validator
from typing import Annotated
class Settings(BaseSettings):
model_config = SettingsConfigDict(
env_file=".env",
env_file_encoding="utf-8",
case_sensitive=False,
extra="ignore",
)
database_url: str
secret_key: str
allowed_hosts: list[str]
environment: str
log_level: str = "INFO"
redis_url: str = "redis://localhost:6379/0"
cors_origins: list[str] = ["http://localhost:3000"]
db_pool_size: int = 10
db_max_overflow: int = 5
sentry_dsn: str | None = None
app_version: str = "dev"
deploy_sha: str = "unknown"
@field_validator("environment")
@classmethod
def validate_environment(cls, v: str) -> str:
allowed = {"development", "staging", "production"}
if v not in allowed:
raise ValueError(f"environment must be one of {allowed}, got '{v}'")
return v
@field_validator("log_level")
@classmethod
def validate_log_level(cls, v: str) -> str:
allowed = {"DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"}
if v.upper() not in allowed:
raise ValueError(f"log_level must be one of {allowed}, got '{v}'")
return v.upper()
settings = Settings()
Node.js — envalid
import { cleanEnv, str, num, url, bool } from "envalid";
const env = cleanEnv(process.env, {
DATABASE_URL: url(),
SECRET_KEY: str({ desc: "JWT signing key" }),
ALLOWED_HOSTS: str({ desc: "Comma-separated allowed hosts" }),
NODE_ENV: str({ choices: ["development", "staging", "production"] }),
LOG_LEVEL: str({
choices: ["debug", "info", "warn", "error"],
default: "info",
}),
REDIS_URL: url({ default: "redis://localhost:6379/0" }),
PORT: num({ default: 3000 }),
SENTRY_DSN: str({ default: "" }),
APP_VERSION: str({ default: "dev" }),
DEPLOY_SHA: str({ default: "unknown" }),
});
export default env;
Rules
- Every required env var must be validated at startup, not on first use
- Fail with a human-readable error: "Missing required environment variable: DATABASE_URL" — not a cryptic NoneType error 3 stack frames deep
- Document every env var in a
.env.example file committed to the repo
- New env vars added in a PR MUST be set in production BEFORE the deploy, or have a safe default
- Never use
os.getenv("SECRET") without a fallback or validation — it silently returns None
Detection — find unvalidated env vars:
grep -rn "os\.getenv\|os\.environ\[" --include="*.py" src/ app/ \
| grep -v "settings\|config\|test"
grep -rn "process\.env\." --include="*.ts" --include="*.js" src/ \
| grep -v "node_modules\|config\|env\.ts\|env\.js\|test"
comm -23 \
<(grep -rhoP '(?:os\.getenv|os\.environ\[|process\.env\.)["'"'"']?\K[A-Z_]+' src/ | sort -u) \
<(grep -oP '^[A-Z_]+' .env.example 2>/dev/null | sort -u)
4. Deployment Strategies
Choose the right strategy for the risk level. There is no universal best — each has tradeoffs.
Rolling Update (Default for Stateless Services)
New instances start, pass health checks, and begin receiving traffic. Old instances drain and shut down. At any point during the deploy, both old and new code are running simultaneously.
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
containers:
- name: api
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 15
periodSeconds: 20
terminationGracePeriodSeconds: 30
When to use: Standard deploys of stateless services with backward-compatible changes. Most deploys.
Risk: Old and new code run simultaneously. Database schema and API contracts must be backward-compatible.
Blue-Green (Zero-Downtime with Instant Rollback)
Run two identical environments. Deploy to the inactive one ("green"), verify, then switch traffic. Rollback is instant — switch traffic back to "blue."
gcloud run deploy my-service \
--image gcr.io/my-project/my-service:${NEW_SHA} \
--no-traffic \
--tag canary
curl -s https://canary---my-service-xxxxx.a.run.app/health/ready
gcloud run services update-traffic my-service --to-latest
gcloud run services update-traffic my-service \
--to-revisions=my-service-00042-abc=100
services:
blue:
image: myapp:current
ports: ["8001:8000"]
green:
image: myapp:${NEW_TAG}
ports: ["8002:8000"]
nginx:
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
ports: ["80:80"]
When to use: Critical services where rollback speed matters more than resource cost. Services with hard availability SLAs.
Cost: Requires 2x infrastructure during deploy window. Both environments must be fully functional.
Canary (High-Risk Changes)
Route a small percentage of traffic to the new version. Monitor error rates and latency. Gradually increase traffic if metrics are healthy.
apiVersion: networking.istio.io/v1
kind: VirtualService
spec:
hosts: ["my-service"]
http:
- route:
- destination:
host: my-service
subset: stable
weight: 95
- destination:
host: my-service
subset: canary
weight: 5
gcloud run services update-traffic my-service \
--to-revisions=my-service-00043-def=5,my-service-00042-abc=95
When to use: Large schema changes, major refactors, new integrations, anything where "it worked in staging" is not sufficient confidence.
Canary promotion criteria:
- Error rate within 1.5x of stable baseline for 15 minutes
- Latency p99 within 1.5x of stable baseline
- No new error types in logs
- No increase in pod restarts
Feature Flags (Gradual Rollout at Application Level)
Decouple deploy from release. Ship the code, then enable the feature gradually. This is the safest approach for user-facing changes.
import structlog
logger = structlog.get_logger()
def get_recommendations(user_id: str, feature_flags: FeatureFlags) -> list:
if feature_flags.is_enabled("new_recommendation_engine", user_id=user_id):
logger.info("using_new_engine", user_id=user_id)
return new_recommendation_engine(user_id)
return legacy_recommendation_engine(user_id)
class FeatureFlags:
def __init__(self):
self._flags: dict[str, dict] = {}
def is_enabled(self, flag: str, user_id: str | None = None) -> bool:
config = self._flags.get(flag, {})
if not config.get("enabled", False):
return False
if "percentage" in config and user_id:
return hash(f"{flag}:{user_id}") % 100 < config["percentage"]
return True
Rules:
- Feature flags are temporary. Remove them within 30 days of full rollout.
- Every flag has an owner and a removal date in the tracking system.
- Kill switches (disable a feature instantly) for every new user-facing feature.
- Log when a flag is evaluated — you need this data for debugging.
5. Rollback Playbook
"Redeploy the previous version" is not a rollback plan. A real rollback plan accounts for schema changes, data state, traffic routing, and communication.
Before You Deploy: Write the Rollback Plan
Every deploy PR should include a rollback section. Use this template:
## Rollback Plan
**Estimated rollback time:** 3 minutes
**Rollback command:**
gcloud run services update-traffic my-service --to-revisions=my-service-00042-abc=100
**Schema rollback required:** Yes / No
If yes: `alembic downgrade -1` (tested on staging: [link to test run])
**Data rollback required:** Yes / No
If yes: restore from backup taken at [timestamp] using [procedure]
**Traffic rollback:** Switch nginx upstream back to blue / revert Istio weights
**Who to notify:**
- #engineering-incidents Slack channel
- On-call: @oncall-primary (PagerDuty)
- Product: @pm-name (if user-facing)
**Rollback decision criteria:**
- Error rate > 5% for 2 consecutive minutes
- p99 latency > 2x baseline for 5 minutes
- Any 5xx errors on critical path (checkout, auth)
Schema Rollback
Can you run alembic downgrade -1 (or equivalent) safely?
def upgrade():
op.execute("SET lock_timeout = '2s'")
op.add_column('orders', sa.Column('tracking_url', sa.Text(), nullable=True))
def downgrade():
op.execute("SET lock_timeout = '2s'")
op.drop_column('orders', 'tracking_url')
def upgrade():
op.execute("SET lock_timeout = '2s'")
op.add_column('orders', sa.Column('status', sa.String(50), nullable=True))
def downgrade():
op.drop_column('orders', 'status')
Rules:
- Every migration must have a tested downgrade path
- If the downgrade would lose data, document it explicitly
- Test the downgrade on staging BEFORE deploying to production
- If you cannot downgrade safely, the migration must go through extra review
Data Rollback
When schema rollback is not enough — you need to restore data.
pgbackrest --stanza=myapp --type=time \
--target="2024-01-15 14:30:00+00" restore
pg_restore -Fc -d myapp --table=orders --data-only \
myapp_pre_deploy_20240115.dump
psql -d myapp -c "SELECT COUNT(*) FROM orders;"
Rule: If your deploy modifies existing data (backfills, transforms, deletes), take a backup of the affected tables BEFORE deploying. Not the whole database — just the tables you are changing. This makes restore fast.
Traffic Rollback
Switch traffic back to the previous version without touching code or schema.
kubectl rollout undo deployment/my-service
gcloud run services update-traffic my-service \
--to-revisions=PREVIOUS_REVISION=100
docker compose exec nginx nginx -s reload
curl -s https://my-service.example.com/health/ready | jq .
Communication Checklist
When rolling back, communicate immediately. Silence during an incident is worse than incomplete information.
- Post in the incident channel: "Rolling back deploy of [PR link]. Elevated error rate detected. Investigating."
- Update the status page if user-facing impact is confirmed
- After rollback completes: "Rollback complete. Service restored. Root cause investigation in progress."
- Create an incident ticket for post-mortem within 24 hours
6. Backward Compatibility Checks
During a rolling deploy, old and new code run simultaneously. If new code makes a breaking change, old instances crash. This is not theoretical — it happens on every deploy that violates these rules.
API Backward Compatibility
class OrderResponse(BaseModel):
id: str
total: Decimal
status: str
tracking_url: str | None = None
class OrderResponse(BaseModel):
id: str
total: Decimal
class OrderResponse(BaseModel):
id: str
total: Decimal
order_status: str
class OrderResponse(BaseModel):
id: str
total: Decimal
status: str
order_status: str
@model_validator(mode="after")
def sync_status_fields(self):
self.status = self.order_status
return self
API rules:
- Adding fields is safe
- Removing or renaming fields is breaking
- Changing field types is breaking (string to int, required to optional changes the contract)
- New required request fields are breaking for existing clients
- Use API versioning (
/v1/, /v2/) for intentional breaking changes
Database Backward Compatibility
During rolling deploy, old code runs against the new schema. The schema must work with both versions.
ALTER TABLE orders ADD COLUMN tracking_url TEXT;
ALTER TABLE orders RENAME COLUMN status TO order_status;
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders ADD COLUMN order_status TEXT;
CREATE OR REPLACE FUNCTION sync_order_status() RETURNS TRIGGER AS $$
BEGIN
IF NEW.order_status IS NULL THEN
NEW.order_status := NEW.status;
END IF;
IF NEW.status IS NULL THEN
NEW.status := NEW.order_status;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Configuration Backward Compatibility
New environment variables must either have safe defaults or be set in production BEFORE the deploy.
new_feature_url: str
new_feature_url: str = ""
new_feature_enabled: bool = False
Detection — find backward compatibility violations:
git diff main...HEAD --unified=0 -- "*.py" \
| grep "^-.*:\s*\(str\|int\|float\|bool\|Decimal\|datetime\)" \
| grep -v "test\|#"
grep -rn "alter_column\|RENAME COLUMN\|rename_column" \
alembic/versions/ migrations/ --include="*.py" --include="*.sql"
git diff main...HEAD -- "*.py" \
| grep "^+.*:\s*str$\|^+.*:\s*int$\|^+.*:\s*bool$" \
| grep -i "settings\|config"
7. CI/CD Pipeline Best Practices
A deploy pipeline is a series of gates. Every gate must pass before the next one opens. If any gate fails, the deploy stops. No manual overrides, no "skip CI" on deploy branches.
Pipeline Stages
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests
run: |
pip install -r requirements.txt -r requirements-dev.txt
pytest --tb=short --strict-markers -q
# Tests MUST pass. No flaky test exceptions.
- name: Type check
run: mypy src/ --strict
- name: Lint
run: ruff check src/
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Dependency audit (Python)
run: |
pip install pip-audit
pip-audit -r requirements.txt --strict
- name: Secret scan
uses: trufflesecurity/trufflehog@main
with:
extra_args: --only-verified
migration-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Lint migrations with squawk
run: |
pip install squawk-cli
# Generate SQL from new migrations
alembic upgrade head --sql > migrations.sql
squawk migrations.sql
build:
needs: [test, security-scan, migration-lint]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: |
docker build \
--tag myapp:${{ github.sha }} \
--label org.opencontainers.image.revision=${{ github.sha }} \
.
- name: Scan image with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:${{ github.sha }}
exit-code: 1
severity: HIGH,CRITICAL
ignore-unfixed: true
- name: Push to registry
run: |
docker tag myapp:${{ github.sha }} gcr.io/my-project/myapp:${{ github.sha }}
docker push gcr.io/my-project/myapp:${{ github.sha }}
deploy-staging:
needs: [build]
runs-on: ubuntu-latest
environment: staging
steps:
- name: Deploy to staging
run: |
gcloud run deploy myapp-staging \
--image gcr.io/my-project/myapp:${{ github.sha }} \
--region us-central1
- name: Run smoke tests against staging
run: |
# Wait for deployment to stabilize
sleep 10
curl -sf https://myapp-staging.example.com/health/ready
# Run integration test suite against staging
pytest tests/smoke/ --base-url=https://myapp-staging.example.com
deploy-production:
needs: [deploy-staging]
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: |
gcloud run deploy myapp \
--image gcr.io/my-project/myapp:${{ github.sha }} \
--region us-central1
- name: Verify production health
run: |
sleep 10
curl -sf https://myapp.example.com/health/ready
# Check error rate has not spiked
Pipeline Rules
-
Tests must pass. No "known flaky" exceptions. Fix flaky tests or delete them. A test suite you do not trust is worse than no tests — it trains the team to ignore failures.
-
Security scan must pass. pip-audit, npm audit, or trivy with --exit-code 1. HIGH and CRITICAL findings block the deploy.
-
Migration lint must pass. squawk (or equivalent) catches missing CONCURRENTLY, missing lock_timeout, unsafe NOT NULL additions.
-
Docker image must build and scan clean. Trivy image scan with HIGH/CRITICAL severity gate.
-
Deploy to staging first. Run smoke tests. If smoke tests fail, do not deploy to production. Ever.
-
Production deploy requires approval. Use GitHub environment protection rules, GitLab approval gates, or equivalent. No one person should be able to push to production without review.
-
Tag every deploy. After production deploy succeeds, tag the commit:
git tag -a "deploy-$(date +%Y%m%d-%H%M%S)" -m "Deployed to production"
git push origin --tags
Detection — find pipeline gaps:
ls .github/workflows/ || ls .gitlab-ci.yml || ls Jenkinsfile || echo "NO CI CONFIG FOUND"
grep -r "trivy\|pip-audit\|npm audit\|snyk\|grype" .github/workflows/ .gitlab-ci.yml 2>/dev/null \
|| echo "NO SECURITY SCANNING IN CI"
grep -r "staging" .github/workflows/ .gitlab-ci.yml 2>/dev/null \
|| echo "NO STAGING DEPLOY IN CI"
git log --oneline -20 | grep -i "skip ci\|no-ci\|\[ci skip\]"
Anti-Patterns
These are the deployment mistakes that cause real outages. Every one has been seen in production.
| Anti-Pattern | Why It Kills You | Fix |
|---|
| Deploy Friday at 5 PM | No one around to fix issues, users hit errors all weekend | Deploy early in the week, early in the day. Never before a holiday. |
| "YOLO merge to main" | No review, no checklist, no rollback plan | Enforce branch protection, require PR review, run the checklist |
| Skipping staging | "It works on my machine" is not a deployment strategy | Always deploy to staging first. Always run smoke tests. |
| No rollback plan | "We'll figure it out" becomes "we're figuring it out at 3 AM" | Write the rollback plan before deploying. Test it. |
| Deploying schema + code simultaneously | If code deploy fails, schema is already changed. Rollback breaks. | Deploy schema changes separately from code changes when possible. |
| Manual deploys via SSH | Unreproducible, unauditable, error-prone | All deploys through CI/CD. No SSH to production for deploys. |
--force pushing to main | Destroys commit history, breaks other developers, loses rollback targets | Never force push to main. Ever. |
| Ignoring failed CI checks | "That test is flaky" — until the day it catches a real bug | Fix or delete flaky tests. CI must be trusted. |
| No deploy tags/artifacts | "Which version is in production?" — if you cannot answer instantly, you have a problem | Tag every deploy. Store build artifacts with SHA. |
| Env vars in code, not config | Secrets in git history forever, env-specific logic scattered everywhere | Pydantic Settings / envalid. Everything from environment. |
Quick Reference: Deploy Day Commands
kubectl get pods -l app=my-service -o wide
curl -s https://my-service.example.com/health/ready | jq .
git log --oneline production..main
git tag -a "pre-deploy-$(date +%Y%m%d-%H%M%S)" -m "Pre-deploy checkpoint"
git push origin --tags
curl -s https://my-service.example.com/health/ready | jq .
kubectl rollout undo deployment/my-service
gcloud run services update-traffic my-service --to-revisions=PREVIOUS=100
Cross-References
- For FastAPI-specific production patterns (lifespan, middleware, async), see production-fastapi
- For database migration safety, indexing, and connection pooling, see production-postgres
- For container hardening (multi-stage, non-root, distroless, secrets), see production-docker
- For OpenTelemetry traces, structured logging, and alerting, see production-monitoring
- For comprehensive production-readiness code review, see production-review
- For architecture planning with failure modes and ADRs, see production-planner
- For automated production-readiness checks, see production-check