원클릭으로 Manus에서 모든 스킬 실행

$pwd:

data-eng-data-quality

Name: Data Eng Data Quality
Author: justanesta

// Data quality validation, observability, and monitoring for data pipelines. Use this skill when implementing data quality checks with Great Expectations or Soda Core, designing schema contracts, building anomaly detection, or establishing data observability practices. Covers validation frameworks, quality metrics, SLAs, freshness monitoring, and lineage tracking.

Manus에서 실행

$ git log --oneline --stat

stars:2

forks:0

updated:2026년 2월 15일 04:18

파일 탐색기

7 개 파일

SKILL.md

readonly

related-skills.json

같은 저장소

data-engineering-cloud-infrastructure.md

from "justanesta/claude-code-resources"

Patterns for building and managing cloud data infrastructure on AWS and GCP using Infrastructure as Code, data lake architectures, cost optimization, and security best practices.

2026-02-152

data-eng-streaming-patterns.md

from "justanesta/claude-code-resources"

Streaming data patterns for event-driven architectures and real-time processing. Use this skill when building Kafka pipelines, implementing CDC, designing event sourcing systems, or working with stream processing frameworks like Flink and Kafka Streams. Covers delivery guarantees, backpressure, dead letter queues, and production-grade streaming infrastructure.

2026-02-152

data-eng-testing-patterns.md

from "justanesta/claude-code-resources"

Testing patterns for data engineering pipelines and transformations. Use this skill when writing tests for SQL transforms, dbt models, data contracts, pipeline integration tests, or managing test data. Covers pytest-sql, dbt testing, contract testing, regression testing, and synthetic data generation for reliable data infrastructure.

2026-02-152

data-eng-warehouse-patterns.md

from "justanesta/claude-code-resources"

Patterns and best practices for cloud data warehouses (Snowflake, BigQuery, Redshift), lakehouse architectures, Data Vault 2.0, and ELT pipeline design

2026-02-152

devops-cicd-patterns.md

from "justanesta/claude-code-resources"

Production-ready patterns for continuous integration and continuous deployment pipelines across GitHub Actions, GitLab CI, and general pipeline design principles.

2026-02-152

devops-docker-patterns.md

from "justanesta/claude-code-resources"

Docker containerization patterns including Dockerfile best practices, Compose orchestration, image optimization, networking, volumes, and security hardening for production workloads.

2026-02-152

package.json

"author": "justanesta"

"repository": "justanesta/claude-code-resources"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

데이터 과학자컴퓨터 및 수학직15-2051L4

name	data-eng-data-quality
description	Data quality validation, observability, and monitoring for data pipelines. Use this skill when implementing data quality checks with Great Expectations or Soda Core, designing schema contracts, building anomaly detection, or establishing data observability practices. Covers validation frameworks, quality metrics, SLAs, freshness monitoring, and lineage tracking.

Data Engineering: Data Quality

Comprehensive data quality validation, monitoring, and observability patterns for production data pipelines.

Core Principles

Validate early, validate often -- Catch data issues at ingestion before they propagate downstream. Every pipeline stage should have quality gates.
Schema contracts are APIs -- Treat your data schemas as versioned contracts between producers and consumers. Breaking changes require coordination.
Measure the six dimensions -- Track completeness, accuracy, consistency, timeliness, uniqueness, and validity as quantifiable metrics with thresholds.
Observability over monitoring -- Move beyond threshold alerts to understanding data behavior through freshness, volume, schema, and lineage tracking.
Quality is a pipeline, not a step -- Data quality is not a single validation checkpoint; it is a continuous process woven into every stage of your data lifecycle.

Great Expectations Fundamentals

Define expectations as declarative rules, organize them into suites, and run checkpoints in your pipeline.

import great_expectations as gx

context = gx.get_context()
ds = context.data_sources.add_pandas("customer_source")
asset = ds.add_dataframe_asset(name="customers")
batch_def = asset.add_batch_definition_whole_dataframe("full_batch")

suite = context.suites.add(gx.ExpectationSuite(name="customer_ingestion_suite"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeUnique(column="customer_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToMatchRegex(
    column="email", regex=r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
    column="account_balance", min_value=0, max_value=10_000_000
))

val_def = context.validation_definitions.add(
    gx.ValidationDefinition(name="customer_validation", data=batch_def, suite=suite)
)
checkpoint = context.checkpoints.add(
    gx.Checkpoint(name="customer_checkpoint", validation_definitions=[val_def])
)
result = checkpoint.run()
if not result.success:
    raise ValueError(f"Quality failed: {sum(1 for r in result.run_results.values() if not r.success)} checks")

See great-expectations-patterns.md for:

Custom expectation development
Data Docs generation and hosting
Multi-datasource checkpoint orchestration
Conditional and parameterized expectations

Soda Core Checks

Define data quality checks in YAML and run them with Soda Core for lightweight, configuration-driven validation.

checks for customers:
  - row_count > 0
  - missing_count(customer_id) = 0
  - duplicate_count(customer_id) = 0
  - invalid_count(email) = 0:
      valid regex: "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$"
  - freshness(updated_at) < 2d
  - schema:
      fail:
        when required column missing: [customer_id, email, created_at]
        when wrong type:
          customer_id: integer
  - anomaly detection for row_count:
      warn: only

See soda-checks.md for:

Freshness and schema checks in depth
Anomaly detection configuration
Cross-dataset reference checks
Soda Cloud integration and alerting

Schema Contracts and Evolution

Enforce schema contracts at pipeline boundaries to prevent breaking changes from propagating silently.

from dataclasses import dataclass
from enum import Enum

class Compat(Enum):
    BACKWARD = "backward"   # New schema can read old data
    FORWARD = "forward"     # Old schema can read new data
    FULL = "full"           # Both directions

@dataclass
class SchemaContract:
    name: str
    version: int
    required_columns: dict[str, str]    # column_name -> data_type
    optional_columns: dict[str, str]
    compatibility: Compat

    def validate_dataframe(self, df) -> list[str]:
        violations = []
        for col, dtype in self.required_columns.items():
            if col not in df.columns:
                violations.append(f"Missing required column: {col}")
            elif str(df[col].dtype) != dtype:
                violations.append(f"Column {col}: expected {dtype}, got {df[col].dtype}")
        return violations

    def check_evolution(self, new_contract: "SchemaContract") -> list[str]:
        issues = []
        removed = set(self.required_columns) - set(new_contract.required_columns)
        if removed and self.compatibility in (Compat.FORWARD, Compat.FULL):
            issues.append(f"Removing columns breaks forward compat: {removed}")
        added = set(new_contract.required_columns) - set(self.required_columns)
        if added and self.compatibility in (Compat.BACKWARD, Compat.FULL):
            issues.append(f"Adding required columns breaks backward compat: {added}")
        return issues

See schema-contracts.md for:

Avro and Protobuf schema registry integration
Automated migration generation
Contract testing in CI/CD pipelines
Handling nullable and default value evolution

Anomaly Detection Patterns

Detect unexpected changes in data volume, distributions, and value ranges using statistical methods.

import numpy as np

class VolumeAnomalyDetector:
    def __init__(self, z_threshold: float = 3.0):
        self.z_threshold = z_threshold

    def check(self, current: int, history: list[int]) -> dict:
        if len(history) < 7:
            return {"is_anomaly": False, "reason": "insufficient_history"}
        mean, std = np.mean(history), np.std(history)
        if std == 0:
            return {"is_anomaly": current != mean}
        z = (current - mean) / std
        return {"is_anomaly": abs(z) > self.z_threshold, "z_score": round(z, 3),
                "expected_range": (round(mean - self.z_threshold * std), round(mean + self.z_threshold * std))}

See anomaly-detection.md for:

Distribution shift detection with KL divergence
Seasonal adjustment for time-series metrics
Multi-metric correlation analysis
Alert fatigue reduction strategies

Data Observability

Track freshness, volume, schema changes, and lineage across your entire data platform.

from dataclasses import dataclass
from datetime import datetime, timedelta
from enum import Enum

class Status(Enum):
    HEALTHY = "healthy"
    WARNING = "warning"
    CRITICAL = "critical"

@dataclass
class TableHealthCheck:
    table_name: str
    freshness_threshold: timedelta
    expected_schema: dict[str, str]

    def check_freshness(self, last_updated: datetime) -> Status:
        age = datetime.utcnow() - last_updated
        if age > self.freshness_threshold * 2:
            return Status.CRITICAL
        return Status.WARNING if age > self.freshness_threshold else Status.HEALTHY

    def check_schema(self, actual: dict[str, str]) -> tuple[Status, list[str]]:
        diffs = [f"Missing: {c}" for c in self.expected_schema if c not in actual]
        diffs += [f"{c}: expected {t}, got {actual[c]}"
                  for c, t in self.expected_schema.items()
                  if c in actual and actual[c] != t]
        status = Status.CRITICAL if any("Missing" in d for d in diffs) \
            else Status.WARNING if diffs else Status.HEALTHY
        return status, diffs

See observability-patterns.md for:

Lineage graph construction and querying
Automated freshness SLA enforcement
Incident response runbooks
Integration with Monte Carlo, Elementary, and OpenLineage

Quality Metrics and SLAs

Define measurable quality dimensions and enforce them through SLAs with automated scoring.

@dataclass
class QualityDimension:
    name: str                       # "completeness", "accuracy", "timeliness"
    weight: float                   # weights must sum to 1.0
    threshold_warn: float
    threshold_fail: float
    current_score: float = 0.0

def composite_score(dims: list[QualityDimension]) -> dict:
    score = sum(d.current_score * d.weight for d in dims)
    fails = [d.name for d in dims if d.current_score < d.threshold_fail]
    warns = [d.name for d in dims if d.threshold_fail <= d.current_score < d.threshold_warn]
    return {"score": round(score, 4), "status": "fail" if fails else "warn" if warns else "pass"}

See quality-metrics.md for:

Six dimensions of data quality measurement
SLA definition and enforcement patterns
Quality dashboards and trend reporting
Cost-of-poor-quality estimation

Anti-Patterns

Avoid	Use Instead
Validating only at pipeline end	Quality gates at each stage (ingestion, transform, load)
Hardcoded thresholds without history	Adaptive thresholds from rolling statistics
Silent failures writing bad data downstream	Fail-fast with circuit breakers and dead-letter queues
Schema-on-read with no contracts	Explicit schema contracts between producers and consumers
Single global quality score	Per-dimension scores with independent thresholds
Manual checks after incidents	Automated continuous validation in pipelines
Alerting on every anomaly	Tiered alerting (info/warn/critical) with escalation policies
Testing quality only in production	Contract tests in CI/CD before deployment

Performance

Sample large datasets -- Use TABLESAMPLE or partition-level checks on multi-billion row tables.
Parallelize checks -- Run column-level expectations concurrently; Soda checks are independent by default.
Partition-aware validation -- Only validate changed partitions, not the entire table.
Cache historical metrics -- Store rolling statistics in a metrics table rather than recomputing each run.
Approximate aggregations -- HyperLogLog for cardinality, t-digest for percentiles on terabyte tables.
Schedule strategically -- Freshness/volume every 15 min; distribution checks daily off-peak.

source: Great Expectations docs, Soda Core docs, Data Quality Fundamentals (O'Reilly), Monte Carlo documentation

data-eng-data-quality

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Data Engineering: Data Quality

Core Principles

Great Expectations Fundamentals

Soda Core Checks

Schema Contracts and Evolution

Anomaly Detection Patterns

Data Observability

Quality Metrics and SLAs

Anti-Patterns

Performance

Data Engineering: Data Quality

Core Principles

Great Expectations Fundamentals

Soda Core Checks

Schema Contracts and Evolution

Anomaly Detection Patterns

Data Observability

Quality Metrics and SLAs

Anti-Patterns

Performance