Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

preferences-data-modeling

Étoiles13

Forks0

Mis à jour22 juin 2026 à 03:42

Data modeling conventions including Arrow-based columnar interchange, DuckDB/DuckLake lakehouse patterns, SQLMesh transformations, and scientific data contracts. Load when designing database structures, data pipelines, cross-language analytics, or data relationships.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

cameronraysmith

cameronraysmith/vanixiets

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

SKILL.md

readonly

Plus depuis ce dépôt

même dépôt

knowledge-graph

cameronraysmith/vanixiets

Index curated reference corpora into a searchable knowledge graph via the cognee engine, then query it to ground technical writing, review, and analysis. Use when ingesting reference documents into named datasets or retrieving grounding context for tasks like drafting or reviewing a manuscript. A reference-knowledge index, explicitly not agent session memory.

2026-06-2313

preferences-nix-development

cameronraysmith/vanixiets

Nix development conventions for flakes, derivations, modules, and code style. Use when authoring flake.nix files, writing derivations or builders, designing NixOS/nix-darwin/home-manager modules, or following nix formatting and naming conventions. For check architecture and CI integration, see preferences-nix-checks-architecture and preferences-nix-ci-cd-integration.

2026-06-2313

refinement-driven-development

cameronraysmith/vanixiets

Approximately-verifiable, refinement-driven development for type-driven domain-driven design. Use when modeling a domain as a dependently-typed Lean 4 specification, refining/lowering it to a Rust implementation, lifting the implementation back via Charon and Aeneas to check spec<->implementation correspondence (translation validation) — mechanically when tractable, otherwise via differential testing or LLM comparison — or when generating and diffing type-system diagrams of the model and implementation to track their evolution. Mechanical on-the-nose proof is the precise ideal, not a requirement; its absence is not a failure of the method.

2026-06-2213

meta-skill-creator

cameronraysmith/vanixiets

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

2026-06-2213

preferences-algebraic-data-types

cameronraysmith/vanixiets

Algebraic data type patterns including sum types, product types, and pattern matching across languages. Load when designing type hierarchies or working with discriminated unions.

2026-06-2213

preferences-algebraic-laws

cameronraysmith/vanixiets

Algebraic laws including functor/monad laws and property-based testing strategies. Load when verifying algebraic properties or writing property tests.

2026-06-2213

name	preferences-data-modeling
description	Data modeling conventions including Arrow-based columnar interchange, DuckDB/DuckLake lakehouse patterns, SQLMesh transformations, and scientific data contracts. Load when designing database structures, data pipelines, cross-language analytics, or data relationships.

Data modeling

Scope and purpose

This document covers data modeling for data pipelines and analytics using Apache Arrow as the in-memory interchange format, DuckDB as the query engine, DuckLake for lakehouse catalog management, and SQLMesh for data transformation and build orchestration.

What this document covers:

Arrow columnar format as the interoperability substrate across language runtimes
Cross-language DuckDB bindings producing Arrow record batches
Data warehouse schemas and transformations via SQLMesh
DuckLake operational patterns and remote data access
Columnar format evolution (Parquet, Vortex, Lance)
Data quality checks, validation, and scientific data contracts
Semantic layers and metrics
Pipeline orchestration patterns

What this document does not cover:

Application domain modeling (see domain-modeling.md)
Business logic workflows
State machines for entity lifecycles
Smart constructors for domain types

The key distinction is that this document is about modeling data at rest (tables, schemas, metrics) and data in motion (Arrow record batches, IPC, cross-language interchange) while domain-modeling.md is about modeling domain logic (entities, workflows, state transitions).

For application domain modeling using functional programming patterns, see:

domain-modeling.md for core patterns (smart constructors, state machines, workflows, aggregates)
architectural-patterns.md for application structure (onion architecture, dependency injection)

Foundational principles

Type safety

Data schemas are contracts. Treat them as type signatures that make implicit assumptions explicit.

Use Pydantic schemas for data validation at ingestion boundaries
Define explicit column types in all model definitions (avoid SELECT *)
Document grain (primary keys) as part of the model contract
Implement audit checks (uniqueness, not-null constraints) as executable type constraints
Leverage schema evolution capabilities for non-breaking changes only
Test schema changes in isolated environments before applying to production

Functional patterns

Data transformations should behave as pure functions where practical.

Treat SQL views and models as lazily-evaluated pure functions
Use immutable data files (Parquet via DuckLake or similar formats)
Make state changes explicit through snapshots and time travel capabilities
Isolate side effects at pipeline boundaries (ingestion, egress)
Prefer composition over inheritance in metric and model definitions
Document data lineage as function composition chains

Immutability and append-only patterns

Once written, data files should never be modified in place.

Use append-only data files for all raw and processed data
Implement updates through delete-insert patterns at partition boundaries
Maintain historical snapshots for reproducibility and debugging
Version all schema and model definitions in source control
Treat data deletions as tombstone markers rather than physical removal where compliance allows

Schema management

Evolution principles

Schema changes should be transactional and backward-compatible when possible.

Use transactional DDL for all schema modifications
Document breaking vs non-breaking changes explicitly
Employ column-level lineage tracking for impact analysis
Test schema changes in virtual or development environments first
Plan migration paths for breaking changes
Maintain schema documentation alongside code

Validation at boundaries

Validate data shape and constraints at system boundaries.

Implement validation at ingestion points (before data enters the lake)
Use Pydantic or similar libraries for runtime type checking in Python
Define data quality checks as executable contracts
Fail fast on validation errors rather than propagating bad data
Log validation failures for debugging and monitoring
Consider partial validation for streaming scenarios (LLM outputs, etc.)

Type-driven design with ADTs

For comprehensive guidance on modeling domain types and error handling, see:

~/.claude/skills/preferences-algebraic-data-types/SKILL.md — sum types (discriminated unions), product types, newtype pattern, making illegal states unrepresentable
~/.claude/skills/preferences-railway-oriented-programming/SKILL.md — Result types, bind/apply composition, effect signatures, two-track model

Key principles from ADT design:

Use sum types (ENUMs, discriminated unions) to model "OR" relationships
Use product types (records, tables) to model "AND" relationships
Wrap primitives in newtypes to prevent semantic confusion (UserId vs OrderId)
Design schemas so illegal states cannot be represented
Use Result<T, E> for operations that can fail
Distinguish bind (sequential, short-circuit) from apply (parallel, collect errors)

Incremental processing patterns

Time-partitioned models

When processing event data, partition by time to enable efficient incremental updates.

Use INCREMENTAL_BY_TIME_RANGE or equivalent patterns for time-series data
Configure lookback windows to handle late-arriving data gracefully
Document time columns and partition granularity explicitly
Implement idempotent transformations (delete + insert within partition)
Track processing state explicitly (watermarks, snapshots)
Test backfill and reprocessing scenarios

State tracking

Make data pipeline state explicit and queryable.

Maintain processing metadata (last processed timestamp, record counts)
Use snapshots for point-in-time recovery
Implement time travel for debugging and rollback
Track data lineage and transformation history
Version control pipeline definitions and state schemas

For modeling pipeline execution states (Pending, Running, Completed, Failed), consider using state machine patterns from domain-modeling.md's state machines for entity lifecycles section. This applies to job status tracking, data quality checks, and orchestration workflows.

Semantic layer patterns

Logical vs physical separation

Separate business logic (metrics, dimensions) from physical storage schema.

Define metrics once in version-controlled configuration (YAML, SQL)
Abstract physical table structure behind logical views
Enable ad-hoc query composition without pre-aggregation
Use semantic definitions as documentation and contracts
Maintain metric lineage and dependencies explicitly
Test metrics in isolation from physical implementation

Metric composition

Build complex metrics from simpler, reusable components.

Define base metrics as atomic units
Compose derived metrics from base metrics with explicit dependencies
Document metric calculation logic alongside definitions
Version metrics and track breaking changes
Validate metric consistency across different tools and interfaces

Arrow as interoperability substrate

Apache Arrow is the in-memory columnar format that makes DuckDB zero-copy across language runtimes. This is the foundational architectural choice connecting the query engine (DuckDB), on-disk storage (Parquet and its successors), and computational consumers across Python, Julia, and Rust.

Arrow record batches serve as the universal in-memory interchange format. DuckDB query results are natively Arrow-backed through the C Data Interface, which means results cross language boundaries without serialization or copying. Parquet files memory-map to Arrow with predicate pushdown, so the transition from storage to compute avoids unnecessary materialization. Arrow IPC enables efficient cross-process and cross-language data transfer, and GPU streaming via Arrow IPC or GDS/KvikIO saturates GPU compute without CPU-side bottlenecks.

The practical consequence is that any consumer that speaks Arrow — whether a Python dataframe library, a Julia differential equations solver, or a Rust web server — can receive query results from DuckDB with zero intermediate representation. This makes Arrow the lingua franca of the analytics stack rather than a format choice for any single language ecosystem.

Cross-language DuckDB bindings

DuckDB provides first-class bindings that all produce Arrow record batches through the same C Data Interface. The bindings vary in maturity and ecosystem integration but share the same zero-copy Arrow output path.

Python: duckdb package, the most mature binding with tight pandas and polars integration
Julia: DuckDB.jl (source at ~/projects/duckdb/tools/juliapkg), enabling Julia's SciML ecosystem (DiffEq.jl, Lux.jl, KernelAbstractions.jl) to consume lakehouse data directly as Arrow record batches
Rust: duckdb-rs (~/projects/omicslake-workspace/duckdb-rs) for embedded analytics, and async-duckdb (~/projects/rust-workspace/async-duckdb) for async web application serving via axum

All three language ecosystems benefit equally from the DuckDB-to-Arrow pipeline. A single DuckLake catalog can serve Python notebooks running probabilistic inference, Julia processes solving differential equations, and Rust services delivering query results over HTTP — all consuming the same Arrow record batches without format translation.

DuckDB and DuckLake patterns

Read models and materialized views in analytics architectures form a Galois connection with the underlying event log or source data. The projection from events to views is information-lossy but composable; views are quotients of the source data under an equivalence relation. Query caching is memoization over the query profunctor, with cache invalidation governed by naturality conditions. DuckLake's time-travel capability implements temporal versioning as indexed types, connecting to the bitemporal semantics described in event sourcing. See preferences-theoretical-foundations references/abstraction-as-adjunction.md for the order-theoretic model in which a best abstraction, when one exists, is the left adjoint of a Galois connection.

Metadata management

Store catalog metadata in SQL databases, not file-based catalogs.

Use DuckLake or similar for lakehouse patterns with full ACID support
Store all table metadata (schemas, snapshots, statistics) in relational catalogs
Leverage transactional guarantees for atomic updates
Query metadata tables directly for lineage and governance
Avoid complex file-based metadata (JSON, Avro manifests) where possible

Query optimization

Leverage DuckDB's strengths for analytical workloads.

Use Parquet for columnar storage with compression
Push down predicates and projections to file scans
Leverage partition pruning through statistics
Query data where it lives to reduce data movement
Use inline data storage for small, frequently-changing tables
Profile queries to identify bottlenecks

Local development to production

Bridge local DuckDB development with production deployments.

Develop and test locally with DuckDB files or in-memory databases
Use consistent SQL dialect between local and production environments
Deploy to cloud environments (Fabric, MotherDuck) with minimal code changes
Maintain parity between local and production data samples
Test incremental logic locally before scheduling in production

DuckLake operational patterns

DuckLake provides ACID transactions without running a database server, implementing an embedded lakehouse pattern where the catalog is a DuckDB database and data lives in columnar files.

Zero-copy table registration via CREATE TABLE ... WITH NO DATA registers external schemas without materializing data, which is useful for schema-only integration with existing Parquet file collections. The DATA_PATH sentinel controls where materialized data lives, separating catalog state from data storage location. The ducklake:hf:// protocol enables direct catalog access to HuggingFace-hosted datasets.

For SQLMesh integration, use a separate DuckLake state connection to keep the transformation engine's internal state isolated from the catalog being built. This avoids contention between SQLMesh's plan/apply lifecycle and DuckLake's ACID transactions on the same catalog.

httpfs and remote data access

DuckDB's httpfs extension enables predicate pushdown over remote Parquet files, meaning only the relevant row groups and columns are transferred over the network. The hf:// protocol provides native access to HuggingFace Hub datasets without downloading full files locally.

This enables server-side DuckDB architectures where a web application queries remote DuckLake catalogs and streams Arrow record batches to clients, avoiding the need for browsers or client applications to download full datasets. Combined with DuckLake's catalog metadata, the query planner can prune partitions and push predicates before any data crosses the network boundary.

Format evolution

Parquet is the current standard on-disk columnar format and the default for DuckLake data storage. Two emerging formats extend the columnar ecosystem in complementary directions.

Vortex (~/projects/omicslake-workspace/vortex) is an emerging successor to Parquet with improved compression and query performance. DuckDB already supports native Vortex read and write, though DuckLake's catalog implementation does not yet support Vortex as a storage backend.

Lance (used by ~/projects/omicslake-workspace/slaf) is an Arrow-native format optimized for ML and vector workloads. DuckDB supports Lance via extension, and Lance files are lakehouse-compatible.

The practical stance is Parquet today, with Vortex and Lance as they mature. DuckDB's native support for all three formats means the transition is transparent to query consumers — the Arrow record batches they receive are identical regardless of which on-disk format backs the data.

SQLMesh integration

Model kinds and patterns

Choose appropriate model kinds for different use cases.

SEED - for static reference data loaded from files
INCREMENTAL_BY_TIME_RANGE - for time-partitioned event data
VIEW - for logical transformations without materialization
FULL - for complete refresh of derived tables
Document rationale for model kind selection

Virtual environments and testing

Test changes safely before affecting production.

Use virtual environments to preview changes
Run audits and tests before applying plans
Validate data quality in isolated environments
Enable column-level lineage for impact analysis
Roll back changes atomically if issues arise

Audit and governance

Build data quality checks into model definitions.

Define audits as executable contracts (unique, not-null, ranges)
Document grain (primary keys) explicitly in model metadata
Track model dependencies and lineage
Version models and track breaking changes
Implement continuous testing for data quality

Scientific data contract patterns

Data artifacts from scientific and computational workflows benefit from columnar storage and DuckDB queryability when they follow consistent structural conventions.

Simulation ensembles produce millions of (parameter, observation) pairs where observations may be variable-length time series. Arrow list and struct columns naturally represent these variable-length trajectories without flattening to a fixed grid, and DuckDB's nested type support means the data is directly queryable without denormalization.

Diagnostic tables — SBC ranks, z-scores, contractions, calibration metrics — are inherently tabular with one row per (dataset_id, parameter). These are the simplest case: flat columnar tables that DuckDB processes efficiently and that translate directly to diagnostic visualizations.

Posterior samples are tabular with (sample_id, parameter columns), directly queryable for summary statistics, quantiles, and convergence diagnostics. Storing these as Parquet partitioned by model version enables time-travel queries across model iterations.

Model provenance registries track which artifacts belong to which model version: (model_version, ensemble_path, approximator_path, timestamp, invalidated_by). This is the metadata layer that connects simulation outputs, trained approximators, and diagnostic results into a queryable lineage graph.

Experimental design sweep results follow the pattern (design_config, parameter, metric_value), enabling cross-configuration queries to identify which experimental designs yield the best inference performance.

See preferences-scalable-probabilistic-modeling-workflow section 06 for diagnostic threshold definitions and paradigm-specific applicability.

Current data hosting stack

The concrete operational stack consists of three components: SQLMesh for data build and transformation, DuckDB with DuckLake for catalog management and query, and HuggingFace Hub with git-xet/git-lfs for data hosting and distribution. This combination provides a serverless lakehouse where the catalog is a DuckDB file, data lives in Parquet on HuggingFace, and SQLMesh manages the transformation DAG.

Data ingestion (dlt) and workflow orchestration (Dagster) are plausible future additions that could become relevant under certain operational contexts. Neither is part of the current operational stack, and their adoption would depend on whether ingestion complexity or scheduling requirements exceed what SQLMesh provides natively.

Composability and modularity

Reusable transformations

Build libraries of reusable data transformations.

Extract common logic into views or utility models
Parameterize transformations where appropriate
Document inputs, outputs, and assumptions
Test transformations in isolation
Version transformation logic separately from business rules

Cross-tool consistency

Ensure metrics and logic remain consistent across tools.

Define metrics in a central semantic layer
Use the same SQL dialect across tools where possible
Share schema definitions between validation and transformation code
Test metric calculations across different interfaces
Document expected behavior and edge cases

Best practices summary

When modeling data, prioritize:

Explicit contracts — schemas, types, and constraints as documentation
Immutability — append-only patterns and versioning
Composability — building complex models from simple, reusable components
Testability — validating transformations in isolation
Lineage — tracking data flow from source to consumption
Type safety — leveraging type systems to catch errors early
Transactionality — ACID guarantees for consistency
Separation of concerns — logical models separate from physical storage
Arrow interchange — zero-copy columnar format as the universal data handoff between languages and processes

Remember that data pipelines should make side effects explicit in their signatures and isolate them at boundaries to preserve compositionality, discharging those effects through capability interfaces and handlers rather than committing to any single carrier such as a monad transformer stack. See preferences-theoretical-foundations (and its references/effects-handlers.md) for why a capability interface, not a transformer tower, is the primitive.

For concrete implementation patterns see ~/.claude/skills/preferences-railway-oriented-programming/SKILL.md (Result types, bind/apply, effect signatures) and ~/.claude/skills/preferences-algebraic-data-types/SKILL.md (domain modeling with sum/product types).