| name | data-engineering-lifecycle-and-principles |
| description | Apply Reis & Housley's Data Engineering Lifecycle (Generation → Storage → Ingestion → Transformation → Serving) plus the six undercurrents (Security, Data Management, DataOps, Data Architecture, Orchestration, Software Engineering) and the nine architecture principles (common components, plan for failure, scalability, leadership, always architecting, loose coupling, reversibility, security, FinOps). Use when scoping a new data platform, diagnosing why a data system is failing, deciding what role / team structure a company needs, or evaluating maturity. Triggers: "build a data platform", "are we doing data engineering right", "what's the data engineering lifecycle", "data team structure", "data maturity", "data engineering principles", "data engineer vs data scientist".
|
Data Engineering Lifecycle and Principles
You apply Joe Reis & Matt Housley's framework from Fundamentals of Data
Engineering: data engineering is a discipline organized around a lifecycle
(end-to-end flow of data) and undercurrents (cross-cutting concerns), guided
by nine architecture principles that hold across cloud-native data systems.
When to Use This Skill
- Scoping a new data platform from scratch — what stages and undercurrents must be addressed
- Diagnosing why an existing data team is firefighting — likely a missing undercurrent
- Deciding team structure: when to split data engineer / analytics engineer / data scientist roles
- Evaluating organizational data maturity (Starting / Scaling / Leading)
- Greenfield architecture review against the nine principles
- Educating a non-data leader on what data engineering actually is
The Lifecycle (Five Stages)
GENERATION → STORAGE → INGESTION → TRANSFORMATION → SERVING
│
├── Analytics (BI, dashboards)
├── ML (training data, features)
└── Reverse ETL (back to SaaS)
Each stage has its own concerns. Storage is unique — it spans all stages because
data is stored at every transition.
| Stage | Primary concern | Common pitfalls |
|---|
| Generation | Source system reliability, schema, volume | Engineers ignored upstream; "we'll deal with it downstream" |
| Storage | Cost vs latency tradeoff; retention; format | Storing raw forever; one-size-fits-all storage |
| Ingestion | Frequency (batch vs streaming), idempotency, ordering | Building bespoke pipelines instead of using CDC tooling |
| Transformation | Schema evolution, lineage, idempotency, testability | Untested SQL; no rerun safety; no semantic layer |
| Serving | Query patterns, freshness SLAs, access control | Serving raw data without modeling; no semantic layer |
The Six Undercurrents
Cross-cutting concerns that span every lifecycle stage. A weakness in any
undercurrent eventually breaks the lifecycle.
| Undercurrent | What it covers | Failure looks like |
|---|
| Security | Access control, encryption, secrets, audit, least privilege | Compliance fire drill; data leaks; broad-default access |
| Data Management | Lineage, governance, quality, master data, metadata, MDM | "Where does this number come from?" with no answer |
| DataOps | Monitoring, SLOs, deployments, on-call for data | 3am pages with no runbook; silent data quality failures |
| Data Architecture | Platform design, integration patterns, build/buy decisions | Tool sprawl; bespoke everything; can't migrate |
| Orchestration | DAG management, dependency tracking, retries, scheduling | Cron jobs everywhere; no dependency graph; ad-hoc reruns |
| Software Engineering | Code quality, testing, version control, IaC, CI/CD | Spreadsheets; SQL in Slack; no PRs |
The discipline diagnostic: if your team is missing one undercurrent, that's
where the next outage / compliance issue / migration disaster will originate.
The Nine Architecture Principles
Source: Reis & Housley, distilled across multiple cloud architecture frameworks
(AWS Well-Architected, Google Cloud's 5 principles, plus the authors' synthesis).
1. Choose Common Components Wisely
Pick widely adopted, interoperable building blocks. Avoid bespoke when commodity will do.
In practice: Postgres before custom-built. Airflow / Dagster / Prefect before homegrown DAG runner. Parquet / Iceberg / Delta before proprietary formats.
2. Plan for Failure
Every component fails. Define availability, durability, RTO (Recovery Time Objective), RPO (Recovery Point Objective) explicitly.
In practice: Multi-region for tier-1 data. Idempotent writes everywhere. Automated runbooks. Practiced restores, not theoretical ones.
3. Architect for Scalability
Scale up and scale down. Scale to zero is a feature; idle resources are waste.
In practice: Serverless for spiky loads. Auto-scaling for sustained ones. Separate compute from storage so each scales independently.
4. Architecture Is Leadership
Architects are senior engineers, not ivory-tower designers. They mentor, document, and decentralize decisions.
In practice: Architects on PRs. Architects writing ADRs. Architects pairing with juniors, not gatekeeping releases.
5. Always Be Architecting
The target architecture is a moving target. Continuously refactor; don't wait for "the rewrite."
In practice: Quarterly architecture reviews. Strangler-fig migrations. No "freeze for 6 months while we redesign."
6. Build Loosely Coupled Systems
Components communicate via abstraction layers (APIs, events, queues). Internal change is invisible to consumers.
In practice: Contract-tested APIs. Event-driven where producers and consumers shouldn't know each other. No shared database mutations between services.
7. Make Reversible Decisions
Default to choices that can be undone. The cost of being wrong should be hours, not years.
In practice: Open formats (Parquet over proprietary). Cloud-portable infrastructure (Terraform, Kubernetes). Multi-cloud-friendly storage abstractions. Don't sign 5-year DW contracts on month 1.
8. Prioritize Security
Security is part of design, not a final-stage checklist. Least privilege, zero trust, encryption at rest and in transit, secrets management.
In practice: No service accounts with admin. No long-lived credentials. Audit logs for every PII access. Secrets in a vault, not env vars in YAML.
9. Embrace FinOps
Cloud is consumption-based. Treat cost as a continuous engineering concern, not an end-of-quarter spreadsheet shock.
In practice: Cost-per-query dashboards. Tagged workloads. Compute warehouse auto-suspend. Engineering KPI: $/business-outcome, not $/raw-resource.
Data Maturity Model (Three Stages)
| Stage | Characteristics | Right move | Wrong move |
|---|
| Starting with data | Ad hoc requests, spreadsheets, single analyst, no platform | Build the foundation: warehouse + ETL + dashboards. Defer ML. | Hire a data scientist as the first data hire. |
| Scaling with data | Multiple teams, formal practices emerging, data platform exists, governance gaps | Invest in undercurrents: lineage, quality, DataOps. Establish ownership. | Buy 8 vendor tools to "modernize" before fixing process. |
| Leading with data | Mature platform, data products, decentralized ownership (often Data Mesh), constant maintenance | Optimize, extend, govern. Push compute and decisions to domains. | Build a central monolith because "consistency". |
Diagnostic: Maturity is determined by how the team operates, not how much it spends. A 3-person team using dbt + Snowflake + Airflow can be at "Scaling"; a 50-person team firefighting can be at "Starting".
Data Engineer Roles — Type A vs Type B
Type A (Abstraction): Builds on managed services and common tooling. Composes a stack from existing parts. The right hire for most companies.
Type B (Build): Develops custom tools and platforms when off-the-shelf doesn't fit (FAANG-scale problems, novel domains).
Internal-facing: BI, executive dashboards, finance reports. Stable schemas; correctness over latency.
External-facing: Product analytics, transactional event data, customer-visible dashboards. Latency-sensitive; reliability over fancy modeling.
Hiring rule: Identify Type A vs Type B and Internal vs External before writing the job description. Confusing these produces bad hires and frustrated engineers.
Principles
- Lifecycle thinking beats stage thinking. A failure at Serving usually originates in Generation or Storage. Trace upstream.
- Undercurrents are non-negotiable. Missing one will eventually outweigh all the gains from your shiniest pipeline tool.
- Business value > technology novelty. A boring Postgres + cron stack that ships analytics weekly beats a bespoke streaming Kappa setup that's 6 months from production.
- Reversibility > optimization. Optimizing for a wrong assumption is worse than carrying a 20% inefficiency that you can refactor away.
- Cost is a design dimension. $5K/month of unattended Snowflake warehouse equals one engineer's headcount. Treat it accordingly.
- Data quality is a feature, not a phase. Build assertions, monitors, and data contracts in from day 1. Adding them later is 10x harder.
- Production-grade software discipline applies. Tests, version control, code review, IaC. "It's just a notebook" is an anti-pattern.
Anti-Patterns
Premature Machine Learning
Looks like: Hiring a data scientist before you have reliable ETL. Promising "AI features" with no production-quality training data.
Why it fails: The Data Science Hierarchy of Needs is real. 70-80% of a DS's time is foundational work that should be done by data engineering. Skipping the foundation produces unreliable models and frustrated DS hires.
The fix: Foundation first. Lifecycle stages 1-4 must be reliable before stage 5 (Serving for ML) is meaningful.
Tool-First Architecture
Looks like: "We need Snowflake/Databricks/Airflow because everyone has it." Tool selected before use case understood.
Why it fails: Tools don't solve undefined problems. Your team ends up customizing the tool to fit the unclear use case, multiplying effort.
The fix: Use case → access patterns → architecture → tool. In that order.
"Big Data" as the Default Frame
Looks like: Treating every dataset as if it were petabyte-scale. Choosing distributed everything.
Why it fails: Most companies have small data. Distributed systems are 10x more operational burden than a single large machine.
The fix: Measure your actual data volume and growth. A single Postgres instance handles 90% of business data. Don't reach for Spark until you have to.
One-Size-Fits-All Storage
Looks like: Everything in one warehouse, or everything in one lake, or everything in one database.
Why it fails: Workloads have different access patterns, latency requirements, and cost profiles. Forcing all into one tier overpays for some and underdelivers for others.
The fix: Layered storage. Hot path (low latency, expensive). Warm path (queryable, moderate cost). Cold path (archival, cheap). Each workload uses the appropriate tier.
Architects as Gatekeepers
Looks like: Architects produce designs from a tower; engineers implement them and find they don't fit reality.
Why it fails: Architects lose touch with the actual system. Designs miss constraints. Engineers feel disempowered. Architecture decisions ossify.
The fix: Architects on the team, in PRs, doing some implementation. Architects mentor and decentralize, not approve and gate.
Static Architecture
Looks like: "We did the architecture review last year. We'll redo it next year."
Why it fails: Business changes faster than annual reviews. Drift accumulates. The "next architecture rewrite" balloons until it's existential.
The fix: Always Be Architecting. Quarterly small refactors. Strangler-fig migrations. Never a 6-month freeze.
Cost as Afterthought
Looks like: End-of-quarter shock at the cloud bill. Engineers rewriting last quarter's code for cost.
Why it fails: Cost surfaces are far from the engineer who controls them. By the time finance complains, the architecture is locked in.
The fix: Cost telemetry on every workload. Engineering KPI on $/outcome. Tag everything. Auto-suspend idle resources.
Decision Rules
| Situation | Action |
|---|
| Greenfield data platform, < 50 person company | Cloud DW (Snowflake / BigQuery / Redshift) + dbt + managed orchestrator. Skip custom. |
| Real-time analytics required | Confirm "real time" is genuinely sub-second; usually "near-real-time" (1-10 min) is enough and 10x simpler. |
| First data hire | Type A internal-facing data engineer. Not data scientist. Not architect. |
| ML team requesting features | Confirm DE foundation is solid. If not, fix that first. |
| Multiple teams want different DWs | Either Data Mesh (decentralized, but requires maturity) or one DW with strong governance. Don't pretend "5 DWs is fine, they'll integrate later." |
| Vendor proposing 3-year contract | Reject. Reversibility principle. Negotiate a 1-year with extension options. |
| Tool sprawl complaint | Audit by undercurrent. Likely some undercurrent has 4 tools that should be 1. |
| "We need a data scientist" | Verify lifecycle stages 1-4 are reliable. If not, hire a data engineer first. |
| Annual architecture review only | Switch to quarterly. Always Be Architecting. |
| Cloud bill surprise | Establish cost telemetry. Tag every workload. Set per-workload budget alerts. |
| Data quality issue blamed on engineer | Likely a missing undercurrent (governance, lineage, monitoring). Fix the system, not the person. |
Worked Example: Diagnosing a Failing Data Team
Context: Series-B startup, 50 employees, 4-person data team. Constant fires: stale dashboards, exec frustration, two recent compliance escalations.
Lifecycle audit:
| Stage | State | Issue |
|---|
| Generation | OK | Production app emits clean events |
| Storage | Cluttered | One Snowflake DB, no schemas, no governance |
| Ingestion | Bespoke | Custom Python scripts on cron; no idempotency; no monitoring |
| Transformation | Untested | Inline SQL in Looker; no version control |
| Serving | Slow | Looker queries hit raw tables; 2-min response time |
Undercurrents audit:
| Undercurrent | State | Issue |
|---|
| Security | Failing | Service account with admin; PII in shared schemas |
| Data Management | Failing | No lineage, no MDM, "what is a customer" varies |
| DataOps | Failing | No SLOs; no monitoring; on-call is one engineer's Slack |
| Data Architecture | OK | Snowflake choice was sound |
| Orchestration | Failing | Cron + Python; no DAG visibility |
| Software Engineering | Failing | No PR review on transformations; no tests; no IaC |
Diagnosis: Architecture is OK. Five of six undercurrents are failing. This is why fires don't stop.
Plan (six months, in priority order):
- DataOps: Adopt dbt + Airflow. Move SQL to version control. Add tests on every model. (Wins: ingestion + transformation + orchestration + SE undercurrents.)
- Data Management: Define data product owners. Adopt a metadata catalog (DataHub / Atlan / OpenMetadata). Establish data contracts at ingestion.
- Security: Audit access. Remove admin service accounts. Mask PII in non-prod. Enforce least privilege.
- DataOps continued: Add SLOs on freshness and quality. Monitor with synthetic queries. Establish on-call runbook.
- Serving: Build a semantic layer. Materialize core dimensional models. Move Looker off raw tables.
- Architecture review: Now that fires are out, quarterly architecture reviews to keep undercurrents healthy.
Lesson: Tool changes are the easy part. The undercurrents are the work. A data team that's "constantly behind" is almost always missing undercurrents, not architectural pieces.
Gotchas
- Lifecycle is not waterfall. Stages happen continuously and concurrently. The diagram is conceptual.
- Undercurrents aren't optional even at small scale. A 2-person team still needs DataOps; it just looks different (a Slack channel + dbt tests instead of a 24/7 on-call).
- "Type A vs Type B" is not seniority. Both are senior roles, just with different focus. Confusing them is a hiring error.
- Maturity is per-undercurrent. A team can be Leading on Architecture and Starting on Security simultaneously. Audit each independently.
- The 9 principles can conflict. "Plan for failure" (multi-region) is in tension with "FinOps" (single-region is cheaper). Architecture is making the tradeoff explicit.
- Reversibility is a tax. Reversible choices are sometimes 10-20% slower or pricier than locked-in ones. The principle says: pay the tax for important decisions, not for trivial ones.
- "Data engineer" is not a unified role. Different companies mean different things. Read the job description carefully — it could be ETL plumber, analytics engineer, or platform builder.
Further Reading
- Fundamentals of Data Engineering by Joe Reis & Matt Housley (O'Reilly, 2022) — Chapters 1, 3, and 11 for this skill
- See
data-architecture-frameworks for TOGAF, DAMA, AWS Well-Architected, Lambda, Kappa, Data Mesh
- See
data-storage-and-modeling-patterns for storage tiers, ingestion patterns, dimensional modeling
- See
dataops-and-modern-data-platforms for DataOps practices and Modern Data Stack composition
Source: Fundamentals of Data Engineering (Reis & Housley), Chapters 1-2 (lifecycle + undercurrents) and 3 (architecture principles).