name	database-design-patterns
description	Database schema design patterns and optimization strategies for relational and NoSQL databases. Use when designing database schemas, optimizing query performance, or implementing data persistence layers at scale.

Database Design Patterns

Expert guidance for designing scalable database schemas, optimizing query performance, and implementing robust data persistence layers across relational and NoSQL databases.

When to Use This Skill

Designing database schemas for new applications
Optimizing slow queries and database performance
Choosing between normalization and denormalization strategies
Implementing partitioning, sharding, or replication strategies
Migrating between database technologies (SQL to NoSQL or vice versa)
Designing for high availability and disaster recovery
Implementing caching strategies and read replicas
Scaling databases horizontally or vertically
Ensuring data consistency in distributed systems

Core Principles

1. Data Modeling Fundamentals

Schema design should reflect business domain, access patterns, and consistency requirements.

Key Considerations:

✓ Model entities and relationships clearly
✓ Design for your query patterns, not just storage
✓ Consider read vs. write ratios
✓ Plan for data growth and scalability
✓ Balance normalization with performance needs

✗ Over-normalize for OLTP workloads
✗ Ignore access patterns
✗ Premature optimization
✗ One-size-fits-all approach

2. ACID vs. BASE Trade-offs

ACID (Relational Databases):

Atomicity: All-or-nothing transactions
Consistency: Data integrity rules enforced
Isolation: Concurrent transactions don't interfere
Durability: Committed data persists

BASE (NoSQL Databases):

Basically Available: System operates despite failures
Soft State: State may change without input
Eventually Consistent: Consistency achieved over time

3. CAP Theorem

Distributed systems can guarantee only two of three:

Consistency: All nodes see same data
Availability: Every request receives response
Partition Tolerance: System continues despite network partitions

4. Polyglot Persistence

Use the right database for each use case:

PostgreSQL/MySQL: Transactional data, complex queries
MongoDB: Flexible schemas, document storage
Redis: Caching, session storage, real-time data
Elasticsearch: Full-text search, log analysis
Cassandra: High write throughput, time-series data
Neo4j: Graph relationships, social networks

Schema Design Patterns

Pattern 1: Normalization (1NF, 2NF, 3NF, BCNF)

Purpose: Eliminate data redundancy and maintain consistency.

First Normal Form (1NF):

-- Violation: Multiple values in single column
CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_name VARCHAR(100),
  products VARCHAR(500)  -- "Product1,Product2,Product3"
);

-- 1NF: Atomic values only
CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_name VARCHAR(100)
);

CREATE TABLE order_items (
  order_id INT,
  product_id INT,
  quantity INT,
  PRIMARY KEY (order_id, product_id),
  FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

Third Normal Form (3NF):

-- Violation: Transitive dependency
CREATE TABLE employees (
  employee_id INT PRIMARY KEY,
  employee_name VARCHAR(100),
  department_id INT,
  department_name VARCHAR(100),  -- Depends on department_id
  department_budget DECIMAL(12,2)  -- Depends on department_id
);

-- 3NF: Remove transitive dependencies
CREATE TABLE employees (
  employee_id INT PRIMARY KEY,
  employee_name VARCHAR(100),
  department_id INT,
  FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

CREATE TABLE departments (
  department_id INT PRIMARY KEY,
  department_name VARCHAR(100),
  department_budget DECIMAL(12,2)
);

When to Use:

OLTP systems with frequent writes
Strong consistency requirements
Complex business rules
Data integrity critical

Trade-offs:

More tables = more joins
Can slow down read-heavy workloads
Complex queries for reporting

Pattern 2: Denormalization for Performance

Purpose: Optimize read performance by storing redundant data.

-- Normalized (requires joins)
SELECT o.order_id, o.order_date, c.customer_name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

-- Denormalized (single table query)
CREATE TABLE orders_denormalized (
  order_id INT PRIMARY KEY,
  order_date DATE,
  customer_id INT,
  customer_name VARCHAR(100),  -- Redundant
  customer_email VARCHAR(100),  -- Redundant
  order_total DECIMAL(10,2)
);

-- Trade-off: Faster reads, but must update customer info in multiple places

When to Use:

Read-heavy workloads (OLAP, analytics)
Reporting dashboards
Caching materialized views
Data warehouses

Strategies:

Materialized views
Aggregate tables
Computed columns
ETL into data warehouse

Pattern 3: Star Schema (Data Warehousing)

Purpose: Optimize analytical queries with fact and dimension tables.

-- Fact table (quantitative data)
CREATE TABLE sales_fact (
  sale_id BIGINT PRIMARY KEY,
  date_id INT,
  product_id INT,
  customer_id INT,
  store_id INT,
  quantity INT,
  revenue DECIMAL(12,2),
  cost DECIMAL(12,2),
  profit DECIMAL(12,2),
  FOREIGN KEY (date_id) REFERENCES date_dimension(date_id),
  FOREIGN KEY (product_id) REFERENCES product_dimension(product_id),
  FOREIGN KEY (customer_id) REFERENCES customer_dimension(customer_id),
  FOREIGN KEY (store_id) REFERENCES store_dimension(store_id)
);

-- Dimension tables (descriptive attributes)
CREATE TABLE date_dimension (
  date_id INT PRIMARY KEY,
  date DATE,
  year INT,
  quarter INT,
  month INT,
  day_of_week VARCHAR(10),
  is_holiday BOOLEAN
);

CREATE TABLE product_dimension (
  product_id INT PRIMARY KEY,
  product_name VARCHAR(200),
  category VARCHAR(100),
  brand VARCHAR(100),
  price DECIMAL(10,2)
);

-- Query: Total revenue by product category and quarter
SELECT
  p.category,
  d.year,
  d.quarter,
  SUM(s.revenue) as total_revenue
FROM sales_fact s
JOIN product_dimension p ON s.product_id = p.product_id
JOIN date_dimension d ON s.date_id = d.date_id
GROUP BY p.category, d.year, d.quarter;

Benefits:

Simple queries for analysts
Excellent query performance
Easy to understand structure

Pattern 4: Document Design (MongoDB)

Purpose: Store related data together for efficient retrieval.

Embedding (One-to-Few):

// Good: Embed related data accessed together
{
  "_id": ObjectId("..."),
  "customer_name": "John Doe",
  "email": "john@example.com",
  "addresses": [
    {
      "type": "shipping",
      "street": "123 Main St",
      "city": "Boston",
      "state": "MA",
      "zip": "02101"
    },
    {
      "type": "billing",
      "street": "456 Oak Ave",
      "city": "Boston",
      "state": "MA",
      "zip": "02102"
    }
  ]
}

Referencing (One-to-Many):

// Orders collection (parent)
{
  "_id": ObjectId("..."),
  "order_date": ISODate("2024-01-15"),
  "customer_id": ObjectId("..."),
  "item_ids": [
    ObjectId("item1"),
    ObjectId("item2"),
    ObjectId("item3")
  ]
}

// Order Items collection (children)
{
  "_id": ObjectId("item1"),
  "product_id": ObjectId("..."),
  "quantity": 2,
  "price": 29.99
}

Two-Way Referencing (Many-to-Many):

// Products collection
{
  "_id": ObjectId("prod123"),
  "name": "Laptop",
  "category_ids": [ObjectId("cat1"), ObjectId("cat2")]
}

// Categories collection
{
  "_id": ObjectId("cat1"),
  "name": "Electronics",
  "product_ids": [ObjectId("prod123"), ObjectId("prod456")]
}

Guidelines:

Embed data accessed together
Reference when data is updated independently
Avoid unbounded arrays (use pagination)
Consider 16MB document size limit

Indexing Strategies

Index Types

B-Tree Indexes (Default for Most Databases)

Purpose: Fast lookups, range queries, sorting.

-- Single column index
CREATE INDEX idx_customers_email ON customers(email);

-- Composite index (order matters!)
CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_date);

-- Query benefits from composite index (uses both columns)
SELECT * FROM orders
WHERE customer_id = 123
AND order_date >= '2024-01-01';

-- Query benefits partially (uses only customer_id)
SELECT * FROM orders
WHERE customer_id = 123;

-- Query does NOT benefit (order_date not leftmost)
SELECT * FROM orders
WHERE order_date >= '2024-01-01';

Composite Index Guidelines:

Equality conditions first, then range conditions
Most selective columns first
Consider query patterns

Hash Indexes

Purpose: Exact match lookups (very fast, O(1)).

-- PostgreSQL hash index
CREATE INDEX idx_users_username_hash
ON users USING HASH (username);

-- Only useful for equality checks
SELECT * FROM users WHERE username = 'john_doe';  -- Fast

-- NOT useful for range queries
SELECT * FROM users WHERE username > 'john';  -- Won't use index

When to Use:

Equality searches only
High cardinality columns (many unique values)
Memory constraints (smaller than B-tree)

Covering Indexes

Purpose: Query satisfied entirely by index (no table lookup).

-- Covering index includes all query columns
CREATE INDEX idx_orders_covering
ON orders(customer_id, order_date, order_total);

-- Query uses index-only scan (very fast)
SELECT customer_id, order_date, order_total
FROM orders
WHERE customer_id = 123
AND order_date >= '2024-01-01';

Partial Indexes

Purpose: Index only subset of rows.

-- Index only active orders
CREATE INDEX idx_active_orders
ON orders(order_date)
WHERE status = 'active';

-- Smaller index, faster queries for active orders
SELECT * FROM orders
WHERE status = 'active'
AND order_date >= '2024-01-01';

Full-Text Indexes

Purpose: Search text content efficiently.

-- PostgreSQL full-text index
CREATE INDEX idx_products_fulltext
ON products USING GIN (to_tsvector('english', description));

-- Full-text search query
SELECT * FROM products
WHERE to_tsvector('english', description) @@ to_tsquery('laptop & wireless');

Index Best Practices

-- 1. Analyze query execution plans
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;

-- 2. Monitor index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan;

-- 3. Remove unused indexes
DROP INDEX idx_rarely_used;

-- 4. Consider index maintenance cost
-- Indexes slow down INSERT, UPDATE, DELETE operations
-- Balance read performance vs. write performance

Index Anti-Patterns:

Over-indexing (too many indexes)
Indexes on low-cardinality columns (e.g., boolean)
Redundant indexes (column already in composite index)
Indexes never used by queries

Partitioning Patterns

Horizontal Partitioning (Sharding)

Purpose: Distribute data across multiple servers for scalability.

Range Partitioning

-- PostgreSQL range partitioning by date
CREATE TABLE orders (
  order_id BIGINT,
  order_date DATE,
  customer_id INT,
  order_total DECIMAL(10,2)
) PARTITION BY RANGE (order_date);

-- Partitions by month
CREATE TABLE orders_2024_01 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_2024_02 PARTITION OF orders
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Query automatically routes to correct partition
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

Benefits:

Old data can be archived/dropped easily
Queries scan only relevant partitions
Maintenance operations (VACUUM, REINDEX) per partition

Hash Partitioning

-- Distribute data evenly across partitions
CREATE TABLE customers (
  customer_id INT,
  customer_name VARCHAR(100),
  email VARCHAR(100)
) PARTITION BY HASH (customer_id);

CREATE TABLE customers_p0 PARTITION OF customers
FOR VALUES WITH (MODULUS 4, REMAINDER 0);

CREATE TABLE customers_p1 PARTITION OF customers
FOR VALUES WITH (MODULUS 4, REMAINDER 1);

CREATE TABLE customers_p2 PARTITION OF customers
FOR VALUES WITH (MODULUS 4, REMAINDER 2);

CREATE TABLE customers_p3 PARTITION OF customers
FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Use Cases:

Evenly distribute data
No natural partitioning key
Parallel query processing

List Partitioning

-- Partition by discrete values
CREATE TABLE sales (
  sale_id BIGINT,
  region VARCHAR(50),
  sale_amount DECIMAL(10,2)
) PARTITION BY LIST (region);

CREATE TABLE sales_north_america PARTITION OF sales
FOR VALUES IN ('USA', 'Canada', 'Mexico');

CREATE TABLE sales_europe PARTITION OF sales
FOR VALUES IN ('UK', 'France', 'Germany', 'Italy');

CREATE TABLE sales_asia PARTITION OF sales
FOR VALUES IN ('Japan', 'China', 'India', 'Singapore');

Sharding Strategies

Purpose: Distribute data across multiple database servers.

Application-Level Sharding

# Shard routing logic in application
def get_db_connection(customer_id):
    shard_id = customer_id % NUM_SHARDS
    return db_connections[shard_id]

# Write
db = get_db_connection(customer_id)
db.execute("INSERT INTO orders ...")

# Read
db = get_db_connection(customer_id)
orders = db.query("SELECT * FROM orders WHERE customer_id = ?", customer_id)

Sharding Keys:

Customer ID: Isolate customer data
Geography: Region-based routing
Tenant ID: Multi-tenant SaaS applications
Hash of ID: Even distribution

Challenges:

Cross-shard queries expensive
Rebalancing shards complex
Transactions across shards difficult
Schema changes require coordination

Database-Level Sharding

-- MongoDB sharding (automatic)
sh.enableSharding("mydb")
sh.shardCollection("mydb.orders", { customer_id: 1 })

-- Citus (PostgreSQL extension)
SELECT create_distributed_table('orders', 'customer_id');

Benefits:

Automatic routing and balancing
Transparent to application
Built-in failover

Replication Patterns

Primary-Replica Replication

Purpose: Scale reads and provide high availability.

Primary (Write) → [Replication] → Replica 1 (Read)
                              → Replica 2 (Read)
                              → Replica 3 (Read)

Write Path:
  Application → Primary → Sync to replicas → Acknowledge

Read Path:
  Application → Load Balancer → Replica (round-robin)

Replication Modes:

Synchronous Replication

-- PostgreSQL synchronous replication
synchronous_commit = on
synchronous_standby_names = 'replica1,replica2'

-- Write waits for replica acknowledgment
-- Pros: No data loss, strong consistency
-- Cons: Higher latency, availability depends on replicas

Asynchronous Replication

-- PostgreSQL asynchronous replication
synchronous_commit = off

-- Write returns immediately
-- Pros: Low latency, high availability
-- Cons: Potential data loss, eventual consistency

Read Scaling Strategy:

# Route reads to replicas, writes to primary
class DatabaseRouter:
    def db_for_read(self):
        return random.choice(REPLICA_CONNECTIONS)

    def db_for_write(self):
        return PRIMARY_CONNECTION

# Application code
user = User.objects.using('replica').get(id=123)  # Read from replica
user.name = "New Name"
user.save(using='primary')  # Write to primary

Replication Lag Handling:

# Read-your-writes consistency
def update_user(user_id, data):
    # Write to primary
    primary_db.execute("UPDATE users SET ... WHERE id = ?", user_id)

    # Read from primary immediately after write
    return primary_db.query("SELECT * FROM users WHERE id = ?", user_id)

# For non-critical reads, use replica
def get_user_profile(user_id):
    return replica_db.query("SELECT * FROM users WHERE id = ?", user_id)

Multi-Leader Replication

Purpose: Accept writes at multiple locations (geo-distributed).

Leader 1 (US) ←→ [Replication] ←→ Leader 2 (EU)
       ↓                                  ↓
   Replica 1                         Replica 2

Applications in US → Leader 1
Applications in EU → Leader 2

Conflict Resolution:

-- Last-write-wins (LWW)
UPDATE users SET
  name = 'Alice',
  updated_at = CURRENT_TIMESTAMP
WHERE id = 123;

-- Custom merge logic
-- User 1 changes email, User 2 changes name
-- Result: Both changes applied

-- Version vectors (Cassandra, DynamoDB)
-- Track changes per node

Use Cases:

Multi-datacenter deployments
Offline-first applications
Collaborative editing

Leaderless Replication (Quorum)

Purpose: No single leader, all nodes accept reads/writes.

Application → [Write to N nodes] → Node 1
                                → Node 2
                                → Node 3

Quorum: W + R > N
  N = Total replicas
  W = Write quorum
  R = Read quorum

Example: N=3, W=2, R=2
  Write succeeds when 2/3 nodes acknowledge
  Read from 2/3 nodes guarantees latest value

Cassandra Example:

-- Consistency level per query
SELECT * FROM users WHERE id = 123;
CONSISTENCY QUORUM;  -- Read from majority

INSERT INTO users (id, name) VALUES (123, 'Alice');
CONSISTENCY QUORUM;  -- Write to majority

Tunable Consistency:

QUORUM: Majority (balance consistency/availability)
ONE: Fastest, least consistent
ALL: Slowest, most consistent
LOCAL_QUORUM: Within single datacenter

Query Optimization Techniques

Query Analysis

-- PostgreSQL EXPLAIN
EXPLAIN ANALYZE
SELECT c.customer_name, COUNT(o.order_id)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_name;

-- Look for:
-- - Sequential scans (add indexes)
-- - High cost operations
-- - Inefficient joins
-- - Missing statistics

Optimization Strategies

1. Index Optimization

-- Before: Sequential scan
SELECT * FROM orders WHERE customer_id = 123;
-- Execution time: 500ms

-- Add index
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

-- After: Index scan
-- Execution time: 5ms

2. Query Rewriting

-- Inefficient: Subquery in SELECT
SELECT
  p.product_name,
  (SELECT COUNT(*) FROM orders o
   WHERE o.product_id = p.product_id) as order_count
FROM products p;

-- Efficient: JOIN with aggregation
SELECT
  p.product_name,
  COUNT(o.order_id) as order_count
FROM products p
LEFT JOIN orders o ON p.product_id = o.product_id
GROUP BY p.product_id, p.product_name;

3. Avoid N+1 Queries

# Bad: N+1 queries (1 for users + N for orders)
users = db.query("SELECT * FROM users")
for user in users:
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)

# Good: Single join query
users_with_orders = db.query("""
    SELECT u.*, o.order_id, o.order_date, o.order_total
    FROM users u
    LEFT JOIN orders o ON u.user_id = o.user_id
""")

4. Pagination

-- Inefficient: OFFSET grows slower
SELECT * FROM orders
ORDER BY order_date DESC
LIMIT 100 OFFSET 10000;  -- Scans 10,100 rows

-- Efficient: Keyset pagination
SELECT * FROM orders
WHERE order_date < '2024-01-01'
ORDER BY order_date DESC
LIMIT 100;  -- Uses index

5. Batch Operations

-- Inefficient: Multiple single inserts
INSERT INTO orders (customer_id, order_date) VALUES (1, '2024-01-01');
INSERT INTO orders (customer_id, order_date) VALUES (2, '2024-01-01');
INSERT INTO orders (customer_id, order_date) VALUES (3, '2024-01-01');

-- Efficient: Batch insert
INSERT INTO orders (customer_id, order_date) VALUES
  (1, '2024-01-01'),
  (2, '2024-01-01'),
  (3, '2024-01-01');

Caching Strategies

Application-Level Caching

# Redis caching layer
def get_user(user_id):
    # Check cache first
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # Cache miss: query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # Store in cache (TTL: 1 hour)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))

    return user

# Cache invalidation on update
def update_user(user_id, data):
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)
    redis.delete(f"user:{user_id}")  # Invalidate cache

Query Result Caching

-- Materialized views (PostgreSQL)
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
  DATE(order_date) as sale_date,
  SUM(order_total) as total_sales,
  COUNT(*) as order_count
FROM orders
GROUP BY DATE(order_date);

-- Refresh periodically
REFRESH MATERIALIZED VIEW daily_sales_summary;

-- Query cached results
SELECT * FROM daily_sales_summary WHERE sale_date = CURRENT_DATE;

Connection Pooling

Purpose: Reuse database connections to reduce overhead.

# Without pooling (inefficient)
def query_database():
    conn = psycopg2.connect(...)  # New connection each time
    cursor = conn.cursor()
    cursor.execute("SELECT ...")
    conn.close()

# With pooling (efficient)
from psycopg2.pool import SimpleConnectionPool

pool = SimpleConnectionPool(
    minconn=5,
    maxconn=20,
    host="localhost",
    database="mydb"
)

def query_database():
    conn = pool.getconn()  # Reuse existing connection
    cursor = conn.cursor()
    cursor.execute("SELECT ...")
    pool.putconn(conn)  # Return to pool

Configuration Guidelines:

Pool Size = (Number of Application Servers × Threads per Server) / Number of DB Servers

Example:
  5 app servers × 10 threads = 50 connections
  2 database servers = 25 connections per DB

Avoid:
  - Too small: Connection exhaustion, queuing
  - Too large: Memory overhead, connection limits

Best Practices Summary

Schema Design: Model for access patterns, balance normalization vs. performance
Indexing: Index frequently queried columns, avoid over-indexing
Partitioning: Use for large tables (>100M rows), time-series data
Replication: Primary-replica for read scaling, multi-leader for geo-distribution
Query Optimization: Analyze execution plans, avoid N+1 queries, use pagination
Caching: Cache hot data, invalidate on updates, use appropriate TTLs
Connection Pooling: Size pools appropriately, monitor usage
Monitoring: Track query performance, index usage, replication lag
Capacity Planning: Estimate growth, plan for 3x current load
Consistency Trade-offs: Choose appropriate consistency level per use case

Resources

Books: "Designing Data-Intensive Applications" (Kleppmann), "High Performance MySQL" (Schwartz)
Sites: use-the-index-luke.com, PostgreSQL docs, MongoDB docs
Tools: EXPLAIN ANALYZE, pg_stat_statements, Percona Toolkit, pt-query-digest
Patterns: Star schema, event sourcing, CQRS, saga pattern, polyglot persistence

name	database-design-patterns
description	Database schema design patterns and optimization strategies for relational and NoSQL databases. Use when designing database schemas, optimizing query performance, or implementing data persistence layers at scale.