with one click
dev-data
// Data engineering and analysis guide for orchestrated sub-agents. Data pipelines, ETL/ELT design, data quality validation, SQL optimization, and analysis patterns. Injected when role=data.
// Data engineering and analysis guide for orchestrated sub-agents. Data pipelines, ETL/ELT design, data quality validation, SQL optimization, and analysis patterns. Injected when role=data.
Unified desktop + browser automation. Routes DOM targets to CDP (cli-jaw browser), desktop apps to Computer Use, hybrid combos to both. Codex desktop/CLI + macOS required for Computer Use.
Backend engineering guide for orchestrated sub-agents. Framework-agnostic API design, clean architecture, database optimization, security hardening, systematic debugging. Modular: SKILL.md orchestrator + references/ for deep guidance. Injected when role=backend.
Code review guide for all orchestrated sub-agents. Review process, quality thresholds, antipattern detection, giving/receiving feedback. Available to any agent regardless of role โ read this SKILL.md when performing or receiving code reviews.
Systematic debugging methodology for all orchestrated sub-agents. 4-phase root cause analysis: investigate โ analyze โ hypothesize โ implement. Injected when encountering errors or during debugging phase.
Production-grade frontend with distinctive aesthetics. Detects stack and applies specialized rules. Modular: SKILL.md orchestrator + references/ for deep guidance. Injected when role=frontend.
PABCD orchestration workflow. Structured 5-phase development with user checkpoints. Injected during orchestration mode.
| name | dev-data |
| description | Data engineering and analysis guide for orchestrated sub-agents. Data pipelines, ETL/ELT design, data quality validation, SQL optimization, and analysis patterns. Injected when role=data. |
| keywords | ["dashboard-db","sqlite","audit-log-schema","connector-data"] |
Production-grade data engineering patterns for building reliable data systems.
Five rules that apply to every data task:
| Principle | What It Means |
|---|---|
| Pipeline thinking | Every pipeline is Extract โ Transform โ Load. Keep each stage as an independent, testable function. |
| Schema-first | Define expected columns, types, and constraints BEFORE writing transformation logic. |
| Defensive parsing | External data will have nulls, wrong types, extra columns, missing columns, and encoding issues. Assume all of these. |
| Idempotent operations | Running the same pipeline twice on the same input must produce the same output. Use upsert patterns, not blind inserts. |
| Fail fast, fail loud | Raise errors on invalid data immediately โ silent failures produce wrong downstream results that are harder to debug. |
| Format | Best For | Watch Out For |
|---|---|---|
| CSV | Simple tabular data, human-readable | Encoding (UTF-8 BOM), delimiter ambiguity, multiline values, inconsistent quoting |
| JSON | Nested structures, API responses | Large files (stream, don't load all at once), deeply nested objects, encoding |
| Parquet | Large analytical datasets, columnar queries | Requires library support, not human-readable, schema evolution |
| Excel | Business user handoffs | Multiple sheets, merged cells, formulas vs. values, date formatting |
| Database | Production system access | Connection pooling, query timeouts, use read replicas for analytics |
For large or frequently updated data sources:
updated_at, id) to track the last processed record.loaded_rows should equal source_rows_since_watermark.Before any transformation, validate incoming data:
โ
Check: Expected columns exist
โ
Check: Data types match (string, number, date, boolean)
โ
Check: Required fields are not null
โ
Check: Values are within expected ranges
โ
Check: No unexpected duplicate keys
โ Fail: If any check fails, write to error log with row details. Don't silently drop.
Rules:
When using dbt for transformations, follow the staging โ intermediate โ mart layer architecture:
Rules:
schema.yml with tests (not_null, unique, relationships, custom SQL).dbt source freshness to monitor upstream data staleness| Scenario | Pattern |
|---|---|
| Invalid records | Write to dead-letter table/file for manual review. Preserve every record for debugging. |
| Source unavailable | Retry with exponential backoff (1s, 2s, 4s). Alert after 3 failures. |
| Schema mismatch | Halt pipeline. Log expected vs. actual schema. Don't attempt partial loads. |
| Duplicate records | Use upsert (INSERT ON CONFLICT UPDATE) or deduplicate with window functions. |
When pipelines have multiple steps with dependencies:
Run these after every pipeline step, not just at the end:
| Check | What It Validates | Example |
|---|---|---|
| Not null | Required fields have values | WHERE order_id IS NULL โ 0 rows |
| Unique | No duplicates on key columns | COUNT(*) = COUNT(DISTINCT id) |
| Range | Numeric values within bounds | amount BETWEEN 0 AND 1,000,000 |
| Categorical | Values in allowed set | status IN ('pending', 'active', 'closed') |
| Freshness | Data is recent enough | MAX(updated_at) > NOW() - INTERVAL '24 hours' |
| Row count | No unexpected data loss or explosion | Within ยฑ10% of previous run |
| Referential | Foreign keys point to existing records | customer_id EXISTS IN customers |
Use a layered quality strategy โ different tools at different pipeline stages:
| Stage | Tool | Purpose |
|---|---|---|
| Ingest | Great Expectations | Validate raw data against expectations before staging |
| Transform | dbt tests | Assert model-level quality (not_null, unique, relationships, custom SQL) |
| Production | Soda / Monte Carlo | Real-time monitoring, anomaly detection, SLA enforcement |
Validate data dimensions: completeness, uniqueness, range, format, referential integrity, freshness.
Rule: Run validation on every pipeline step โ skipping "because the data looks fine" leads to silent downstream corruption.
For datasets shared between teams, define a contract:
A data contract must include:
Changes to a contracted schema require versioning and consumer notification.
Before any deep analysis, provide:
| Metric | What to Report |
|---|---|
| Row count | Total records in dataset |
| Column inventory | Name, type, null count per column |
| Numeric summary | min, max, mean, median, std dev |
| Categorical summary | Unique values, top 5 most frequent |
| Time range | Earliest and latest timestamp |
| Data quality | Null percentage, duplicate percentage |
| Format | When to Use |
|---|---|
| Markdown tables | Inline reports, โค50 rows, quick summaries |
| JSON | Programmatic consumption, API responses |
| CSV export | Handoff to spreadsheet users, large datasets |
| HTML + charts | Dashboards, visual reports (Chart.js, Mermaid diagrams) |
When analysis involves statistics:
| Condition | Choose |
|---|---|
| Real-time insight required (sub-minute latency) | Streaming (Kafka + Flink, Spark Structured Streaming, or Kafka Streams depending on complexity) |
| Exactly-once semantics needed | Kafka transactional producers + Flink/Spark |
| Latency >1 min acceptable, volume >1TB/day | Distributed batch (Spark, Databricks) |
| Latency >1 min acceptable, volume <1TB/day | Single-node batch (SQL, Python, dbt) |
Default to batch. Streaming adds significant complexity in error handling, state management, and debugging. Only use streaming when latency requirements genuinely demand it.
| Latency Requirement | Framework | Complexity |
|---|---|---|
| Sub-100ms, complex stateful | Apache Flink | High (dedicated cluster) |
| Sub-second, existing Spark infra | Spark Structured Streaming | Medium |
| Sub-second, Kafka-centric | Kafka Streams (embedded library) | Low-Medium |
| Minutes acceptable | Batch with frequent scheduling | Low |
Kafka essentials for data engineers:
See references/streaming.md for Kafka configuration, CDC patterns, and windowing.
| Need | Choose |
|---|---|
| SQL analytics, BI dashboards, structured queries | Data warehouse (Snowflake, BigQuery, PostgreSQL) |
| ML training, unstructured data, large-scale storage | Data lake (S3/GCS + Parquet or Delta format) |
| Both SQL and ML needs | Lakehouse (Delta Lake, Apache Iceberg) |
| Real-time key-value lookups, caching | Redis, DynamoDB |
| Graph relationships | Neo4j, Neptune |
| Category | Options |
|---|---|
| Orchestration | Airflow, Prefect, Dagster |
| Transformation | dbt, Spark, plain SQL |
| Streaming | Kafka, Kinesis, Pub/Sub |
| Quality | Great Expectations, dbt tests, Soda, custom validators |
| Monitoring | Prometheus, Grafana, Datadog, Monte Carlo |
| Local analysis | DuckDB (in-process SQL), Polars (fast DataFrame), pandas (exploration/ML) |
| Factor | pandas | Polars | DuckDB |
|---|---|---|---|
| Best for | <1GB, exploration, ML prep | >1GB, batch ETL, performance | SQL analytics, ad-hoc queries |
| Execution | Single-threaded, eager | Multi-threaded Rust, lazy eval | Vectorized, auto disk spill |
| Speed (groupby/join) | Baseline | 5-10x faster | Matches Polars on SQL-native |
| Memory | Full load into RAM | Streaming, lazy chains | Spill-to-disk for out-of-core |
| API style | DataFrame (imperative) | DataFrame (expression-based) | SQL-first |
| ML interop | Excellent (scikit-learn, etc.) | Good (.to_pandas()) | Good (.fetchdf()) |
| File format | CSV, JSON, Excel | CSV, Parquet, Arrow-native | CSV, Parquet, JSON, S3 direct |
Decision rule:
| Data size / workflow | Recommended tool |
|---|---|
| Small (<100MB), interactive exploration | pandas |
| Medium (100MB-10GB), batch transforms | Polars |
| SQL-first analytics, any size | DuckDB |
| Blended workflow | Polars transforms, DuckDB aggregations (zero-copy via Arrow) |
See references/tools.md for full patterns and code examples.
| Level | Examples | Handling |
|---|---|---|
| Public | Aggregated metrics, public reports | No restrictions |
| Internal | Business KPIs, operational data | Access controls, no external sharing |
| Confidential | Customer data, financial records | Encryption at rest, column-level masking |
| Restricted | SSN, payment data, health records | Tokenization, row-level security, audit logging |
Before building any pipeline that touches PII:
| Requirement | Engineering Pattern |
|---|---|
| Right to erasure | Soft delete โ batch purge โ propagate to downstream stores including data lake |
| Data minimization | Collect only necessary fields; TTL on non-essential data |
| Consent tracking | Consent event store with versioned preferences; consent-aware pipeline branches |
| Data portability | Standardized export endpoint (JSON/CSV) per user request |
See references/governance.md for detailed implementation patterns, row-level security, and retention policies.
Data engineering does not exist in isolation. Cross-reference these skills when your pipeline connects to other systems:
| Companion | When to Consult | Key Sections |
|---|---|---|
dev-backend | Exposing data via API, response envelope shape, pagination | ยง5 API Response Contract, ยง2 Layered Architecture |
dev-security | PII handling, data classification, access controls, audit logging | ยง1 Input Validation, ยง4 Secrets, ยง8 Pre-Flight |
dev-testing | Pipeline validation, contract tests for data APIs, CI gates | ยง2 Backend & API Testing, ยง3 Contract Testing |
dev-frontend | Downstream reporting/dashboard consumers, data format expectations | ยง8 Backend Contract & Security Alignment |
Integration patterns:
dev-backend ยง5)dev-security guidance before this skill's ยง7 rules