ワンクリックで
data-eng-warehouse-patterns
// Patterns and best practices for cloud data warehouses (Snowflake, BigQuery, Redshift), lakehouse architectures, Data Vault 2.0, and ELT pipeline design
// Patterns and best practices for cloud data warehouses (Snowflake, BigQuery, Redshift), lakehouse architectures, Data Vault 2.0, and ELT pipeline design
Patterns for building and managing cloud data infrastructure on AWS and GCP using Infrastructure as Code, data lake architectures, cost optimization, and security best practices.
Data quality validation, observability, and monitoring for data pipelines. Use this skill when implementing data quality checks with Great Expectations or Soda Core, designing schema contracts, building anomaly detection, or establishing data observability practices. Covers validation frameworks, quality metrics, SLAs, freshness monitoring, and lineage tracking.
Streaming data patterns for event-driven architectures and real-time processing. Use this skill when building Kafka pipelines, implementing CDC, designing event sourcing systems, or working with stream processing frameworks like Flink and Kafka Streams. Covers delivery guarantees, backpressure, dead letter queues, and production-grade streaming infrastructure.
Testing patterns for data engineering pipelines and transformations. Use this skill when writing tests for SQL transforms, dbt models, data contracts, pipeline integration tests, or managing test data. Covers pytest-sql, dbt testing, contract testing, regression testing, and synthetic data generation for reliable data infrastructure.
Production-ready patterns for continuous integration and continuous deployment pipelines across GitHub Actions, GitLab CI, and general pipeline design principles.
Docker containerization patterns including Dockerfile best practices, Compose orchestration, image optimization, networking, volumes, and security hardening for production workloads.
| name | data-eng-warehouse-patterns |
| description | Patterns and best practices for cloud data warehouses (Snowflake, BigQuery, Redshift), lakehouse architectures, Data Vault 2.0, and ELT pipeline design |
Modern cloud warehouses favor ELT because transformations run inside the warehouse engine, leveraging massive parallel processing. External ETL tools add latency, maintenance burden, and failure points.
-- ELT pattern: land raw JSON, then transform in-warehouse
-- Step 1: Load raw data into staging
COPY INTO raw.stripe_events
FROM @s3_stage/stripe/
FILE_FORMAT = (TYPE = JSON);
-- Step 2: Transform with SQL into clean layer
CREATE OR REPLACE TABLE cleaned.payments AS
SELECT
raw:id::STRING AS event_id,
raw:data:object:amount::NUMBER / 100 AS amount_dollars,
raw:data:object:currency::STRING AS currency,
raw:created::TIMESTAMP_NTZ AS event_timestamp,
CURRENT_TIMESTAMP() AS _loaded_at
FROM raw.stripe_events
WHERE raw:type::STRING = 'charge.succeeded';
See elt-pipeline-patterns for: medallion architecture, incremental loading strategies, staging layer design.
Snowflake separates storage and compute with virtual warehouses. Use multi-cluster warehouses for concurrency, and leverage time travel and zero-copy clones for development.
-- Create a warehouse sized for heavy transformation workloads
CREATE WAREHOUSE transform_wh
WAREHOUSE_SIZE = 'X-LARGE'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4
SCALING_POLICY = 'STANDARD';
-- Cluster a large fact table on commonly filtered columns
ALTER TABLE analytics.fct_orders
CLUSTER BY (order_date, region_id);
-- Use time travel to recover from bad writes
SELECT * FROM analytics.fct_orders
AT(OFFSET => -3600); -- state from 1 hour ago
See snowflake-patterns for: warehouse sizing guidelines, external stages, COPY INTO options, clustering key selection, time travel and cloning.
BigQuery is serverless — there are no clusters to manage. Optimize cost by partitioning tables, clustering on high-cardinality filter columns, and using materialized views to accelerate repeated queries.
-- Partitioned and clustered table for event analytics
CREATE TABLE analytics.user_events (
event_id STRING NOT NULL,
user_id STRING NOT NULL,
event_name STRING,
event_params ARRAY<STRUCT<key STRING, value STRING>>,
device STRUCT<category STRING, os STRING, browser STRING>,
event_timestamp TIMESTAMP NOT NULL,
_partition_date DATE NOT NULL
)
PARTITION BY _partition_date
CLUSTER BY user_id, event_name
OPTIONS (
partition_expiration_days = 365,
require_partition_filter = TRUE
);
See bigquery-patterns for: partition strategies, nested/repeated fields, materialized views, BI Engine reservations, slot management.
Redshift uses a shared-nothing MPP architecture. Distribution keys control how data is spread across nodes, and sort keys define on-disk ordering for range-restricted scans.
-- Fact table with optimized distribution and sort keys
CREATE TABLE analytics.fct_page_views (
page_view_id BIGINT IDENTITY(1,1),
session_id VARCHAR(64) NOT NULL,
user_id BIGINT DISTKEY,
page_url VARCHAR(2048),
referrer_url VARCHAR(2048),
view_duration INTEGER,
view_timestamp TIMESTAMP SORTKEY,
device_type VARCHAR(32),
country_code VARCHAR(2)
)
DISTSTYLE KEY
COMPOUND SORTKEY (view_timestamp, user_id);
See redshift-patterns for: distribution styles (KEY, ALL, EVEN, AUTO), sort key types, COPY from S3, WLM configuration, Redshift Spectrum.
Lakehouse combines the low-cost storage of data lakes with the ACID transactions and performance of data warehouses. Open table formats like Apache Iceberg and Delta Lake enable this convergence.
-- Apache Iceberg table creation in Snowflake (external volume)
CREATE ICEBERG TABLE bronze.web_clickstream (
click_id STRING,
session_id STRING,
url STRING,
referrer STRING,
user_agent STRING,
click_timestamp TIMESTAMP,
_loaded_at TIMESTAMP
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_s3_vol'
BASE_LOCATION = 'clickstream/'
FILE_FORMAT = (TYPE = PARQUET);
-- Time travel with Iceberg snapshots
SELECT * FROM bronze.web_clickstream
AT(TIMESTAMP => '2026-02-13 10:00:00'::TIMESTAMP);
See lakehouse-patterns for: Iceberg vs Delta Lake comparison, table format internals, schema evolution, partition evolution, time travel mechanics.
Data Vault 2.0 is a modeling methodology for enterprise data warehouses. It separates structural data (hubs, links) from descriptive data (satellites), enabling parallel loading and full historical tracking.
-- Hub: business key registry for customers
CREATE TABLE raw_vault.hub_customer (
hub_customer_hk BINARY(32) NOT NULL, -- SHA-256 hash key
customer_bk VARCHAR(64) NOT NULL, -- business key
load_timestamp TIMESTAMP_NTZ NOT NULL,
record_source VARCHAR(128) NOT NULL,
CONSTRAINT pk_hub_customer PRIMARY KEY (hub_customer_hk)
);
-- Satellite: descriptive attributes with full history
CREATE TABLE raw_vault.sat_customer_details (
hub_customer_hk BINARY(32) NOT NULL,
load_timestamp TIMESTAMP_NTZ NOT NULL,
load_end_timestamp TIMESTAMP_NTZ,
hash_diff BINARY(32) NOT NULL, -- change detection
first_name VARCHAR(128),
last_name VARCHAR(128),
email VARCHAR(256),
phone VARCHAR(32),
record_source VARCHAR(128) NOT NULL,
CONSTRAINT pk_sat_cust PRIMARY KEY (hub_customer_hk, load_timestamp)
);
See data-vault-patterns for: hub/link/satellite design rules, hash key generation, point-in-time tables, bridge tables, business vault.
| Avoid | Use Instead |
|---|---|
| Transforming data outside the warehouse in Python/Spark when SQL suffices | ELT: load raw, transform with SQL inside the warehouse |
| Single monolithic warehouse/cluster for all workloads | Separate warehouses per workload (ETL, BI, ad-hoc) |
| No partitioning on large tables (full table scans) | Partition on date/time columns; require partition filters |
Using SELECT * in production queries and views | Explicit column lists to reduce scan volume and breakage risk |
| Storing all history in slowly changing dimension type 1 (overwrite) | Use SCD type 2 or Data Vault satellites for full history |
| Hard-deleting rows from raw/staging layers | Soft-delete with _deleted_at flag; keep raw data immutable |
| Running large transforms during peak BI query hours | Schedule heavy transforms during off-peak windows |
| Skipping clustering/sort keys on frequently filtered columns | Profile query patterns and cluster on top 2-3 filter columns |
| Loading many small files (small file problem) | Batch files to 100-250 MB compressed before loading |
| Granting broad warehouse access without resource monitors | Use resource monitors and per-team warehouse isolation |
CURRENT_TIMESTAMP() or non-deterministic functions in cached queries.EXPLAIN, query history views, and warehouse utilization dashboards to find bottlenecks before optimizing blindly.source: Cloud data warehouse documentation (Snowflake, BigQuery, Redshift), Data Vault 2.0 methodology, Apache Iceberg and Delta Lake specifications, dbt best practices