一键在 Manus 中运行任何 Skill

data-engineering

ETL/ELT pipelines, data warehousing (BigQuery, Snowflake, Redshift), stream processing (Kafka, Spark Streaming), orchestration (Airflow, Dagster, Prefect), dbt transformations, and data lake architecture. Use when building data pipelines, designing warehouse schemas, or implementing real-time data processing.

在 Manus 中运行

概览

安装命令

npx skills add https://github.com/travisjneuman/.claude --skill data-engineering

复制此命令并粘贴到 Claude Code 中以安装该技能

来源

travisjneuman/.claude

星标60

分支13

更新时间2026年3月28日 02:26

SKILL.md

readonly

name	data-engineering
description	ETL/ELT pipelines, data warehousing (BigQuery, Snowflake, Redshift), stream processing (Kafka, Spark Streaming), orchestration (Airflow, Dagster, Prefect), dbt transformations, and data lake architecture. Use when building data pipelines, designing warehouse schemas, or implementing real-time data processing.

Data Engineering

Pipeline Architecture

ETL vs ELT

Pattern	When to Use	Tools
ETL	Transform before loading, data quality critical	Airflow + custom, Spark
ELT	Raw → warehouse → transform in-place	Fivetran + dbt, Airbyte + dbt

Orchestration

Apache Airflow:

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2024, 1, 1), catchup=False)
def my_pipeline():
    @task()
    def extract() -> dict:
        return {"data": "extracted"}

    @task()
    def transform(data: dict) -> dict:
        return {"transformed": True}

    @task()
    def load(data: dict):
        # Load to warehouse
        pass

    raw = extract()
    transformed = transform(raw)
    load(transformed)

my_pipeline()

Dagster (recommended for new projects):

from dagster import asset, Definitions

@asset
def raw_users():
    return extract_from_source()

@asset
def cleaned_users(raw_users):
    return clean_and_validate(raw_users)

dbt Transformations

-- models/marts/dim_customers.sql
{{ config(materialized='table', schema='marts') }}

WITH source AS (
    SELECT * FROM {{ ref('stg_customers') }}
),
orders AS (
    SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spent
    FROM {{ ref('stg_orders') }}
    GROUP BY customer_id
)
SELECT
    s.customer_id,
    s.name,
    s.email,
    COALESCE(o.order_count, 0) as lifetime_orders,
    COALESCE(o.total_spent, 0) as lifetime_value
FROM source s
LEFT JOIN orders o ON s.customer_id = o.customer_id

Stream Processing

Apache Kafka:

from confluent_kafka import Producer, Consumer

# Producer
producer = Producer({'bootstrap.servers': 'localhost:9092'})
producer.produce('events', key='user_123', value=json.dumps(event))
producer.flush()

# Consumer
consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'my-group',
    'auto.offset.reset': 'earliest'
})
consumer.subscribe(['events'])

Data Warehouse Schema Design

Star Schema

Fact tables: Measurable events (orders, clicks, transactions)
Dimension tables: Descriptive context (customers, products, dates)
Slowly Changing Dimensions: Type 1 (overwrite), Type 2 (versioned rows), Type 3 (previous column)

Data Quality

Great Expectations: Schema validation, statistical tests, custom expectations
dbt tests: not_null, unique, accepted_values, relationships, custom SQL tests
Data contracts: Schema evolution policies, backward compatibility requirements

Key Patterns

Idempotent pipelines: Same input always produces same output, safe to rerun
Incremental models: Process only new/changed data, use updated_at watermarks
Dead letter queues: Route failed records for inspection without blocking pipeline
Backfill strategy: Time-partitioned tables enable targeted historical reprocessing

同仓库更多 Skills

同仓库

ar-vr-xr

travisjneuman/.claude

AR/VR/XR development with Unity XR, WebXR, ARKit, ARCore, Meta Quest SDK, and spatial computing. Use when building augmented reality, virtual reality, mixed reality applications, or spatial experiences.

2026-03-2860

blockchain-web3

travisjneuman/.claude

Solidity smart contracts, Web3 development, DeFi protocols, NFTs, EVM chains, Hardhat/Foundry tooling, and blockchain security. Use when writing smart contracts, building dApps, auditing contract security, or integrating Web3 wallets.

2026-03-2860

compliance-engineering

travisjneuman/.claude

SOC2, HIPAA, GDPR, PCI-DSS, FedRAMP compliance implementation in code. Audit logging, data encryption, access controls, privacy by design, and regulatory requirement mapping. Use when implementing compliance controls, preparing for audits, or building privacy-compliant systems.

2026-03-2860

devex-sdk-design

travisjneuman/.claude

Developer experience (DX) engineering, SDK design patterns, API ergonomics, CLI tooling design, documentation-driven development, and developer onboarding. Use when designing SDKs, improving API ergonomics, building developer tools, or creating developer documentation.

2026-03-2860

edge-computing

travisjneuman/.claude

Edge computing with Cloudflare Workers, Deno Deploy, Bun, Vercel Edge Functions, AWS Lambda@Edge, and edge databases (Turso, D1, DynamoDB Global Tables). Use when building low-latency edge applications, edge-side rendering, or globally distributed compute.

2026-03-2860

embedded-iot

travisjneuman/.claude

Embedded systems firmware, microcontrollers (ESP32, STM32, Arduino, Raspberry Pi), RTOS (FreeRTOS, Zephyr), IoT protocols (MQTT, CoAP, BLE), bare-metal C/C++, and hardware peripheral interfaces (I2C, SPI, UART, GPIO). Use when developing firmware, working with microcontrollers, or building IoT devices.

2026-03-2860

来源

travisjneuman

travisjneuman/.claude

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

适用职业SOC

软件开发工程师计算机与数学类职业15-1252L4

name	data-engineering
description	ETL/ELT pipelines, data warehousing (BigQuery, Snowflake, Redshift), stream processing (Kafka, Spark Streaming), orchestration (Airflow, Dagster, Prefect), dbt transformations, and data lake architecture. Use when building data pipelines, designing warehouse schemas, or implementing real-time data processing.

Data Engineering

Pipeline Architecture

ETL vs ELT

Pattern	When to Use	Tools
ETL	Transform before loading, data quality critical	Airflow + custom, Spark
ELT	Raw → warehouse → transform in-place	Fivetran + dbt, Airbyte + dbt

Orchestration

Apache Airflow:

from airflow.decorators import dag, task
from datetime import datetime

@dag(schedule="@daily", start_date=datetime(2024, 1, 1), catchup=False)
def my_pipeline():
    @task()
    def extract() -> dict:
        return {"data": "extracted"}

    @task()
    def transform(data: dict) -> dict:
        return {"transformed": True}

    @task()
    def load(data: dict):
        # Load to warehouse
        pass

    raw = extract()
    transformed = transform(raw)
    load(transformed)

my_pipeline()

Dagster (recommended for new projects):

from dagster import asset, Definitions

@asset
def raw_users():
    return extract_from_source()

@asset
def cleaned_users(raw_users):
    return clean_and_validate(raw_users)

dbt Transformations

-- models/marts/dim_customers.sql
{{ config(materialized='table', schema='marts') }}

WITH source AS (
    SELECT * FROM {{ ref('stg_customers') }}
),
orders AS (
    SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spent
    FROM {{ ref('stg_orders') }}
    GROUP BY customer_id
)
SELECT
    s.customer_id,
    s.name,
    s.email,
    COALESCE(o.order_count, 0) as lifetime_orders,
    COALESCE(o.total_spent, 0) as lifetime_value
FROM source s
LEFT JOIN orders o ON s.customer_id = o.customer_id

Stream Processing

Apache Kafka:

from confluent_kafka import Producer, Consumer

# Producer
producer = Producer({'bootstrap.servers': 'localhost:9092'})
producer.produce('events', key='user_123', value=json.dumps(event))
producer.flush()

# Consumer
consumer = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'my-group',
    'auto.offset.reset': 'earliest'
})
consumer.subscribe(['events'])

Data Warehouse Schema Design

Star Schema

Fact tables: Measurable events (orders, clicks, transactions)
Dimension tables: Descriptive context (customers, products, dates)
Slowly Changing Dimensions: Type 1 (overwrite), Type 2 (versioned rows), Type 3 (previous column)

Data Quality

Great Expectations: Schema validation, statistical tests, custom expectations
dbt tests: not_null, unique, accepted_values, relationships, custom SQL tests
Data contracts: Schema evolution policies, backward compatibility requirements

Key Patterns

Idempotent pipelines: Same input always produces same output, safe to rerun
Incremental models: Process only new/changed data, use updated_at watermarks
Dead letter queues: Route failed records for inspection without blocking pipeline
Backfill strategy: Time-partitioned tables enable targeted historical reprocessing