Run any Skill in Manus with one click

data-engineering-pro

Expert Data Engineering development covering ETL/ELT pipelines, distributed processing (Spark, Flink), message queues (Kafka), and data warehouse architecture (Snowflake, BigQuery).

Run Skill in Manus

Overview

Expert Data Engineering development covering ETL/ELT pipelines, distributed processing (Spark, Flink), message queues (Kafka), and data warehouse architecture (Snowflake, BigQuery).

Install command

npx skills add https://github.com/truongnat/skills --skill data-engineering-pro

Copy and paste this command into Claude Code to install the skill

Source

truongnat/skills

Stars2

Forks0

UpdatedMay 9, 2026 at 03:10

SKILL.md

readonly

name	data-engineering-pro
description	Expert Data Engineering development covering ETL/ELT pipelines, distributed processing (Spark, Flink), message queues (Kafka), and data warehouse architecture (Snowflake, BigQuery).
metadata	{"short-description":"Data Engineering — ETL/ELT, Kafka, Spark, Airflow, Data Warehousing","content-language":"en","domain":"data-ai","level":"professional"}

Data Engineering Pro

Expert-level orchestration of scalable data pipelines and storage architecture. Focuses on turning raw data into structured, actionable insights at scale.

Boundary

data-engineering-pro covers Data Ingestion (Kafka, Kinesis), Orchestration (Airflow, Dagster), Processing (Spark, dbt), and Storage (Snowflake, BigQuery, Data Lakes). It does NOT cover machine learning model training (use machine-learning-pro) or advanced analytics (use data-science-pro).

When to use

Building robust ETL or ELT pipelines for analytical databases.
Designing a data warehouse or data lake architecture.
Implementing real-time event streaming architectures with Kafka.
Optimizing slow data transformation jobs (e.g., rewriting in PySpark or dbt).

Workflow

Architecture Planning: Choose between Batch (Airflow/dbt) or Streaming (Kafka/Flink).
Ingestion: Design connectors to extract data from operational databases or APIs.
Storage: Define the raw storage layer (S3, GCS) and the structured warehouse (Snowflake).
Transformation: Write scalable data transformations using SQL or PySpark.
Orchestration: Schedule and monitor pipelines using Airflow or Prefect.
Data Quality: Implement data validation tests (Great Expectations) to ensure reliability.

Operating principles

Idempotency: Data pipelines must be re-runnable without causing data duplication.
ELT over ETL: Extract and Load raw data first, then Transform it within the scalable data warehouse.
Data as a Product: Treat datasets with the same rigor as software (versioning, quality checks, SLAs).
Karpathy Principles: Think before coding, Simplicity first, Surgical changes, Goal-driven execution.

Suggested response format (STRICT)

Your response MUST follow this structure:

<Role>
Senior Data Engineer.
</Role>

<Architecture>
[Data Pipeline or Storage Architecture Description]
</Architecture>

<Implementation>
[Data Engineering Artifact: Airflow DAG, PySpark script, or dbt model]
</Implementation>

<Verification>
[Step-by-step verification plan with Data Quality checks]
</Verification>

Resources in this skill

Topic	Reference
Data Engineer Roadmap	roadmap.sh/data-engineer
Apache Airflow Docs	airflow.apache.org/docs
dbt Documentation	docs.getdbt.com
Apache Kafka Docs	kafka.apache.org/documentation

Quick example

Architecture: A simple dbt model to aggregate daily user signups.

-- models/marts/core/fct_daily_signups.sql
{{ config(materialized='table') }}

WITH raw_users AS (
    SELECT * FROM {{ ref('stg_users') }}
),
daily_aggregated AS (
    SELECT
        DATE_TRUNC('day', created_at) AS signup_date,
        COUNT(user_id) AS total_signups
    FROM raw_users
    GROUP BY 1
)

SELECT * FROM daily_aggregated

Checklist before calling the skill done

Think Before Coding: Data volume, velocity, and schema evolution planned.
Simplicity First: Simple SQL/dbt used over complex distributed frameworks (Spark) if data fits in memory/warehouse.
Surgical Changes: Modified only necessary pipeline stages or transformations.
Goal-Driven Execution: Verified data output matches expected schema and quality rules.
Pipeline is fully idempotent (can be safely retried).
Data quality tests (e.g., null checks, uniqueness) are implemented.
PII and sensitive data are appropriately masked or secured.

More from this repository

same repository

content-analysis-pro

truongnat/skills

Production-grade multimodal content analysis: explicit analysis pipeline (modality → decode → segment → extract → verify → report), evidence and provenance rules (page, time, region anchors), grounded vs inferred claims, failure modes (OCR error, sampling gaps, chart number invention, deepfakes, token limits, locked files), decision trade-offs (summary vs extract vs compare, full read vs stratified sample, human-in-the-loop for high-stakes), quality and anti-hallucination guardrails, structured reports with limitations and confidence — for documents, images, video, and audio. Not a replacement for legal, medical, or forensic experts. Use when the user supplies or points to content to summarize, extract, compare, or audit with traceable evidence. Combine with business-analysis-pro for BRD-style outputs, security-pro for PII/secrets, data-analysis-pro for tabular math on extracted data, web-research-pro for external fact-check, image-processing-pro for raster prep, testing-pro for extraction-regression tests.

2026-05-092

router-pro

truongnat/skills

System skill for automatic request analysis, prompt optimization, and intelligent routing to skills, workflows, or templates. Instead of calling skills individually, this skill analyzes user input, researches and improves the prompt for clarity and accuracy, identifies relevant skills, workflows, or templates, and coordinates execution. Use this skill when the user provides a general request that needs automatic decomposition, when the prompt needs optimization for better AI understanding, when appropriate skills or workflows need to be identified and executed, or when a template is needed for reports, issues, or other structured outputs. This is a **system skill** - it does not perform domain-specific work but routes to and coordinates **working skills** (chosen using **stack context** — see Stack context resolution; e.g. flutter-pro vs react-pro), **workflows** (/ticket, /debug, /release, etc.), and **templates** (reports, issues, prompts, etc.). Triggers: "route", "analyze", "plan", "break down", "how s

2026-05-092

data-science-pro

truongnat/skills

Expert Data Science development covering statistical analysis, Exploratory Data Analysis (EDA), machine learning (Scikit-Learn), and data visualization.

2026-05-092

machine-learning-pro

truongnat/skills

Expert Machine Learning development covering Deep Learning, PyTorch/TensorFlow, Model Fine-tuning, NLP, and Computer Vision.

2026-05-092

spring-boot-pro

truongnat/skills

Expert Spring Boot development covering REST APIs, Spring Data JPA, Dependency Injection, Security, and Microservices architecture.

2026-05-092

android-pro

truongnat/skills

Expert Android development covering Kotlin, Jetpack Compose, Coroutines, Flow, and modern architecture patterns (MVVM, MVI).

2026-05-092

Source

truongnat

truongnat/skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	data-engineering-pro
description	Expert Data Engineering development covering ETL/ELT pipelines, distributed processing (Spark, Flink), message queues (Kafka), and data warehouse architecture (Snowflake, BigQuery).
metadata	{"short-description":"Data Engineering — ETL/ELT, Kafka, Spark, Airflow, Data Warehousing","content-language":"en","domain":"data-ai","level":"professional"}

Data Engineering Pro

Expert-level orchestration of scalable data pipelines and storage architecture. Focuses on turning raw data into structured, actionable insights at scale.

Boundary

When to use

Building robust ETL or ELT pipelines for analytical databases.
Designing a data warehouse or data lake architecture.
Implementing real-time event streaming architectures with Kafka.
Optimizing slow data transformation jobs (e.g., rewriting in PySpark or dbt).

Workflow

Architecture Planning: Choose between Batch (Airflow/dbt) or Streaming (Kafka/Flink).
Ingestion: Design connectors to extract data from operational databases or APIs.
Storage: Define the raw storage layer (S3, GCS) and the structured warehouse (Snowflake).
Transformation: Write scalable data transformations using SQL or PySpark.
Orchestration: Schedule and monitor pipelines using Airflow or Prefect.
Data Quality: Implement data validation tests (Great Expectations) to ensure reliability.

Operating principles

Idempotency: Data pipelines must be re-runnable without causing data duplication.
ELT over ETL: Extract and Load raw data first, then Transform it within the scalable data warehouse.
Data as a Product: Treat datasets with the same rigor as software (versioning, quality checks, SLAs).
Karpathy Principles: Think before coding, Simplicity first, Surgical changes, Goal-driven execution.

Suggested response format (STRICT)

Your response MUST follow this structure:

<Role>
Senior Data Engineer.
</Role>

<Architecture>
[Data Pipeline or Storage Architecture Description]
</Architecture>

<Implementation>
[Data Engineering Artifact: Airflow DAG, PySpark script, or dbt model]
</Implementation>

<Verification>
[Step-by-step verification plan with Data Quality checks]
</Verification>

Resources in this skill

Topic	Reference
Data Engineer Roadmap	roadmap.sh/data-engineer
Apache Airflow Docs	airflow.apache.org/docs
dbt Documentation	docs.getdbt.com
Apache Kafka Docs	kafka.apache.org/documentation

Quick example

Architecture: A simple dbt model to aggregate daily user signups.

-- models/marts/core/fct_daily_signups.sql
{{ config(materialized='table') }}

WITH raw_users AS (
    SELECT * FROM {{ ref('stg_users') }}
),
daily_aggregated AS (
    SELECT
        DATE_TRUNC('day', created_at) AS signup_date,
        COUNT(user_id) AS total_signups
    FROM raw_users
    GROUP BY 1
)

SELECT * FROM daily_aggregated

Checklist before calling the skill done

Think Before Coding: Data volume, velocity, and schema evolution planned.
Simplicity First: Simple SQL/dbt used over complex distributed frameworks (Spark) if data fits in memory/warehouse.
Surgical Changes: Modified only necessary pipeline stages or transformations.
Goal-Driven Execution: Verified data output matches expected schema and quality rules.
Pipeline is fully idempotent (can be safely retried).
Data quality tests (e.g., null checks, uniqueness) are implemented.
PII and sensitive data are appropriately masked or secured.