ワンクリックで
data-engineering-cloud-infrastructure
// Patterns for building and managing cloud data infrastructure on AWS and GCP using Infrastructure as Code, data lake architectures, cost optimization, and security best practices.
// Patterns for building and managing cloud data infrastructure on AWS and GCP using Infrastructure as Code, data lake architectures, cost optimization, and security best practices.
Data quality validation, observability, and monitoring for data pipelines. Use this skill when implementing data quality checks with Great Expectations or Soda Core, designing schema contracts, building anomaly detection, or establishing data observability practices. Covers validation frameworks, quality metrics, SLAs, freshness monitoring, and lineage tracking.
Streaming data patterns for event-driven architectures and real-time processing. Use this skill when building Kafka pipelines, implementing CDC, designing event sourcing systems, or working with stream processing frameworks like Flink and Kafka Streams. Covers delivery guarantees, backpressure, dead letter queues, and production-grade streaming infrastructure.
Testing patterns for data engineering pipelines and transformations. Use this skill when writing tests for SQL transforms, dbt models, data contracts, pipeline integration tests, or managing test data. Covers pytest-sql, dbt testing, contract testing, regression testing, and synthetic data generation for reliable data infrastructure.
Patterns and best practices for cloud data warehouses (Snowflake, BigQuery, Redshift), lakehouse architectures, Data Vault 2.0, and ELT pipeline design
Production-ready patterns for continuous integration and continuous deployment pipelines across GitHub Actions, GitLab CI, and general pipeline design principles.
Docker containerization patterns including Dockerfile best practices, Compose orchestration, image optimization, networking, volumes, and security hardening for production workloads.
| name | Data Engineering Cloud Infrastructure |
| description | Patterns for building and managing cloud data infrastructure on AWS and GCP using Infrastructure as Code, data lake architectures, cost optimization, and security best practices. |
Infrastructure as Code (IaC) First — Every resource is defined in version-controlled Terraform or equivalent. No manual console changes. Reproducible environments across dev, staging, and production.
Separation of Storage and Compute — Store data in object storage (S3/GCS) and attach compute engines (Athena, BigQuery, Spark) independently. Scale each layer on its own schedule and budget.
Layered Data Architecture — Organize data into raw, cleaned, and curated layers with clear contracts between them. Each layer has its own schema validation, retention policy, and access controls.
Cost-Aware Design from Day One — Choose file formats, partitioning strategies, and compute tiers based on query patterns and budget. Monitor spend continuously with alerts and automated shutdowns.
Least-Privilege Security — Grant the minimum permissions necessary. Use service accounts with scoped IAM roles, encrypt data at rest and in transit, and isolate workloads with VPC boundaries.
AWS provides a mature ecosystem for data engineering: S3 for storage, Glue for cataloging and ETL, Athena for ad-hoc queries, and Redshift for warehousing.
# Glue ETL job reading partitioned Parquet from S3
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ["JOB_NAME", "source_path", "target_path"])
sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
# Read from raw layer with push-down predicate
raw_df = spark.read.parquet(args["source_path"]).filter(
"event_date >= '2025-01-01' AND event_date < '2025-02-01'"
)
# Deduplicate and write to cleaned layer
cleaned_df = raw_df.dropDuplicates(["event_id"]).repartition("event_date")
cleaned_df.write.partitionBy("event_date").mode("overwrite").parquet(args["target_path"])
See aws-data-stack for: S3 lifecycle policies, Glue crawlers, Athena optimization, Redshift Serverless configuration, and Step Functions orchestration.
GCP centers on BigQuery as an integrated storage-plus-compute warehouse, with GCS for staging, Dataflow for streaming, and Cloud Composer for orchestration.
-- BigQuery partitioned and clustered table for event analytics
CREATE TABLE IF NOT EXISTS `project.analytics.user_events`
(
event_id STRING NOT NULL,
user_id STRING NOT NULL,
event_type STRING NOT NULL,
event_payload JSON,
event_timestamp TIMESTAMP NOT NULL,
processing_date DATE NOT NULL
)
PARTITION BY processing_date
CLUSTER BY user_id, event_type
OPTIONS (
partition_expiration_days = 365,
require_partition_filter = TRUE,
description = 'User events partitioned by date, clustered for fast user lookups'
);
See gcp-data-stack for: BigQuery slots and reservations, Dataflow pipeline patterns, Pub/Sub exactly-once delivery, and Cloud Composer DAGs.
Define all cloud resources declaratively. Use modules to encapsulate reusable patterns like data lake buckets, warehouse clusters, and IAM role bindings.
# Terraform module for a data lake S3 bucket with lifecycle rules
module "data_lake_raw" {
source = "./modules/data-lake-bucket"
bucket_name = "acme-data-lake-raw-${var.environment}"
environment = var.environment
lifecycle_rules = [
{
id = "archive-old-data"
prefix = "events/"
transition_days = 90
transition_class = "GLACIER"
expiration_days = 730
}
]
versioning_enabled = true
encryption_kms_key_id = var.kms_key_arn
tags = {
Team = "data-engineering"
CostCenter = "analytics"
DataLayer = "raw"
}
}
See terraform-data-infra for: module patterns for S3/GCS, warehouse provisioning, IAM role definitions, remote state management, and environment promotion.
Organize object storage into layers with strict naming conventions, partitioning by query-relevant columns, and optimized file formats.
# Write DataFrame to data lake with Hive-style partitioning and Parquet
from pyarrow import parquet as pq
import pyarrow as pa
# Target ~128 MB files for optimal Athena/BigQuery performance
TARGET_FILE_SIZE = 128 * 1024 * 1024 # 128 MB
def write_to_data_lake(df, base_path, partition_cols):
"""Write pandas DataFrame as partitioned Parquet to data lake."""
table = pa.Table.from_pandas(df)
pq.write_to_dataset(
table,
root_path=base_path,
partition_cols=partition_cols,
compression="snappy",
max_rows_per_file=1_000_000,
existing_data_behavior="overwrite_or_ignore",
)
# Usage: s3://acme-data-lake-raw/events/event_date=2025-01-15/part-0001.parquet
write_to_data_lake(events_df, "s3://acme-data-lake-raw/events", ["event_date"])
See data-lake-patterns for: partitioning strategies, Parquet vs ORC vs Avro trade-offs, small file compaction, schema evolution, and Delta Lake/Iceberg table formats.
Cloud data costs accumulate in storage, compute, and data transfer. Target each category with specific strategies.
# AWS cost monitoring with CloudWatch alarms
Resources:
DataPipelineBudgetAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: data-pipeline-daily-spend
AlarmDescription: Alert when daily data pipeline spend exceeds $500
Namespace: AWS/Billing
MetricName: EstimatedCharges
Statistic: Maximum
Period: 86400
EvaluationPeriods: 1
Threshold: 500
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref DataEngineeringSlackSNSTopic
Dimensions:
- Name: ServiceName
Value: AmazonAthena
See cost-optimization for: storage tiering strategies, compute autoscaling, reserved and spot capacity, query cost controls, and data transfer optimization.
Protect data with layered defenses: IAM policies scoped to specific resources, encryption for data at rest and in transit, and network isolation.
# IAM role for a Glue ETL job with least-privilege S3 access
resource "aws_iam_role" "glue_etl_role" {
name = "glue-etl-pipeline-${var.environment}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "glue.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "glue_s3_access" {
name = "glue-s3-scoped-access"
role = aws_iam_role.glue_etl_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:ListBucket"]
Resource = [
"arn:aws:s3:::acme-data-lake-raw-${var.environment}",
"arn:aws:s3:::acme-data-lake-raw-${var.environment}/events/*"
]
},
{
Effect = "Allow"
Action = ["s3:PutObject", "s3:DeleteObject"]
Resource = "arn:aws:s3:::acme-data-lake-cleaned-${var.environment}/events/*"
}
]
})
}
See security-patterns for: cross-account access, encryption key management, VPC endpoints, column-level security in BigQuery and Redshift, and audit logging.
| Avoid | Use Instead |
|---|---|
| Creating resources manually in the cloud console | Define everything in Terraform with code review and CI/CD |
| Storing data as uncompressed CSV in object storage | Use columnar formats (Parquet/ORC) with Snappy compression |
| Single large unpartitioned table in the data lake | Partition by date or high-cardinality query column |
Granting s3:* or bigquery.admin to service accounts | Scope IAM policies to specific buckets, prefixes, and actions |
| Running full-table scans without partition pruning | Always include partition column in WHERE clauses |
| Leaving dev/test clusters running 24/7 | Auto-terminate non-production resources on schedule |
| Storing all data in one storage class forever | Apply lifecycle policies to transition to cold/archive tiers |
| Hardcoding credentials in ETL scripts | Use IAM roles, service accounts, and secrets managers |
| Writing thousands of small files to object storage | Compact small files into 128-256 MB Parquet files |
| Using a single AWS account for all environments | Separate accounts per environment with cross-account roles |
source: Cloud provider documentation (AWS, GCP), Terraform Registry, Apache Parquet specification, Delta Lake and Apache Iceberg documentation.