원클릭으로
dremio-lakehouse-modeling
// Teaches an AI agent data modeling best practices in a Dremio lakehouse — medallion architecture, views vs tables, reflections strategy, partitioning, dimensional modeling, and semantic layer design.
// Teaches an AI agent data modeling best practices in a Dremio lakehouse — medallion architecture, views vs tables, reflections strategy, partitioning, dimensional modeling, and semantic layer design.
[HINT] SKILL.md 및 모든 관련 파일을 포함한 전체 스킬 디렉토리를 다운로드합니다
| name | Dremio Lakehouse Modeling |
| description | Teaches an AI agent data modeling best practices in a Dremio lakehouse — medallion architecture, views vs tables, reflections strategy, partitioning, dimensional modeling, and semantic layer design. |
This skill provides opinionated guidance on how to model data in a Dremio lakehouse. It covers architecture patterns, when to use views vs tables, how to design reflections, partitioning strategies, and semantic layer design. Use this when helping users design or refactor their data pipelines and models.
The most common pattern for organizing a Dremio lakehouse. Each layer has a specific purpose and is implemented using different Dremio objects.
What: Raw data landed from sources with minimal transformation. Data types should match the source.
Dremio implementation: Iceberg tables in a lakehouse catalog (Nessie, Arctic, Glue).
-- Ingest raw data from object storage
COPY INTO nessie.bronze.raw_transactions
FROM '@s3source/raw/transactions/'
FILE_FORMAT 'parquet';
-- Or ingest via API/database replication into Iceberg tables
INSERT INTO nessie.bronze.raw_customers
SELECT * FROM postgres_source.public.customers;
Conventions:
raw_stripe_payments, raw_salesforce_accounts_ingested_at timestamp column for lineagePARTITION BY (MONTH(_ingested_at))What: Deduplicated, typed, validated data. Business logic starts here: column renames, type casts, null handling, deduplication.
Dremio implementation: Iceberg tables (materialized from Bronze) or Views (for lightweight transformations).
-- Materialized as an Iceberg table (best for large, reused datasets)
CREATE TABLE nessie.silver.customers AS
SELECT DISTINCT
CAST(id AS INT) AS customer_id,
TRIM(name) AS customer_name,
LOWER(email) AS email,
CAST(created_at AS TIMESTAMP) AS created_at
FROM nessie.bronze.raw_customers
WHERE id IS NOT NULL;
-- Or lightweight as a view (good for simple transformations)
CREATE VIEW analytics.silver.clean_transactions AS
SELECT
transaction_id,
amount,
CASE WHEN status = 'completed' THEN true ELSE false END AS is_completed,
event_date
FROM nessie.bronze.raw_transactions;
Conventions:
What: Aggregated, joined, and ready-for-consumption datasets. Optimized for specific business questions, dashboards, or ML features.
Dremio implementation: Views (preferred) — allows the Dremio query engine and reflections to optimize automatically.
-- Gold view: revenue by region by month
CREATE VIEW analytics.gold.monthly_revenue AS
SELECT
c.region,
DATE_TRUNC('MONTH', t.event_date) AS month,
SUM(t.amount) AS total_revenue,
COUNT(DISTINCT t.customer_id) AS unique_customers
FROM analytics.silver.clean_transactions t
JOIN nessie.silver.customers c ON t.customer_id = c.customer_id
WHERE t.is_completed = true
GROUP BY c.region, DATE_TRUNC('MONTH', t.event_date);
Conventions:
| Criteria | Use a View | Use an Iceberg Table |
|---|---|---|
| Data changes frequently | ✅ Always shows latest source data | Must re-run INSERT/MERGE to refresh |
| Needs DML (UPDATE/DELETE) | ❌ Views are read-only | ✅ Full DML support |
| Large aggregation/join cost | Add a reflection on the view | Materialize once, query fast |
| Needs time travel | ❌ Views don't have snapshots | ✅ AT SNAPSHOT / AT TIMESTAMP |
| Used by many downstream consumers | ✅ Views are cheap to create | Table + OPTIMIZE for large reads |
| Needs partitioning/sorting control | ❌ Views inherit source layout | ✅ Full control over data layout |
| Transformation is simple (rename, cast) | ✅ Near-zero cost | Overhead of table creation |
| Data pipeline with incremental loads | Tables with MERGE for upserts | Views can't receive data |
Rule of thumb:
Reflections are Dremio's query acceleration layer. They're materialized subsets of your data that the query engine uses automatically — users don't need to know they exist.
| Scenario | Reflection Type |
|---|---|
| Dashboard with aggregations (SUM, COUNT, AVG) | Aggregation reflection |
| Frequently queried subset of columns | Raw reflection with selected columns |
| Join between large tables | Raw reflection on the fact table with join columns |
| Filter on high-cardinality column | Raw reflection sorted by that column |
Best for dashboards and reports that repeatedly aggregate along the same dimensions.
ALTER TABLE nessie.silver.transactions
CREATE AGGREGATE REFLECTION txn_agg
USING DIMENSIONS (region, DATE_TRUNC('MONTH', event_date))
MEASURES (amount (SUM, COUNT, AVG), customer_id (COUNT));
Best for speeding up point lookups, filtered scans, or joins.
ALTER TABLE nessie.silver.transactions
CREATE RAW REFLECTION txn_raw
USING DISPLAY (transaction_id, customer_id, amount, event_date)
DISTRIBUTE BY (customer_id)
LOCALSORT BY (event_date);
sys.reflections to see which reflections are being matched| Table Size | Recommended Partitioning |
|---|---|
| < 100 GB | No partitioning needed |
| 100 GB – 1 TB | Single partition column (e.g., MONTH(date)) |
| 1 TB – 10 TB | Two-level (e.g., event_type + MONTH(date)) |
| > 10 TB | Consider BUCKET(n, id) for high-cardinality or three-level |
MONTH(date) is almost always better than DAY(date) for event data.region, partition by region.user_id, partition by date and LOCALSORT by user_id.CREATE TABLE nessie.silver.events (
event_id BIGINT,
event_type VARCHAR,
user_id INT,
event_date DATE,
payload VARCHAR
) PARTITION BY (MONTH(event_date))
LOCALSORT BY (user_id);
Build fact and dimension tables as Iceberg tables in Silver, then create Gold views that join them.
-- Fact table (Iceberg, partitioned)
CREATE TABLE nessie.silver.fact_sales (
sale_id BIGINT,
product_id INT,
customer_id INT,
store_id INT,
sale_date DATE,
quantity INT,
amount DOUBLE
) PARTITION BY (MONTH(sale_date));
-- Dimension tables (Iceberg, small, no partitioning needed)
CREATE TABLE nessie.silver.dim_products (
product_id INT, name VARCHAR, category VARCHAR, price DOUBLE
);
CREATE TABLE nessie.silver.dim_customers (
customer_id INT, name VARCHAR, email VARCHAR, region VARCHAR
);
-- Gold star-schema view
CREATE VIEW analytics.gold.sales_analysis AS
SELECT
f.sale_date,
p.category AS product_category,
c.region AS customer_region,
SUM(f.quantity) AS total_quantity,
SUM(f.amount) AS total_revenue
FROM nessie.silver.fact_sales f
JOIN nessie.silver.dim_products p ON f.product_id = p.product_id
JOIN nessie.silver.dim_customers c ON f.customer_id = c.customer_id
GROUP BY f.sale_date, p.category, c.region;
For Type 2 SCD (maintain history), use MERGE with effective dating:
MERGE INTO nessie.silver.dim_customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id AND target.is_current = true
WHEN MATCHED AND (target.name <> source.name OR target.region <> source.region) THEN
UPDATE SET is_current = false, effective_end = CURRENT_DATE
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, region, effective_start, effective_end, is_current)
VALUES (source.customer_id, source.name, source.email, source.region, CURRENT_DATE, NULL, true);
Analytics (Space)
├── gold/
│ ├── sales_analysis (View)
│ ├── customer_360 (View)
│ └── monthly_revenue (View)
├── silver/
│ ├── clean_transactions (View)
│ └── clean_customers (View)
└── reports/
├── executive_dashboard (View)
└── ops_metrics (View)
Guidelines:
pii, certified, draft)-- Create space and folder structure
-- (typically done via UI, CLI, or REST API)
-- Document a Gold view
-- Use Wiki API or CLI: dremio wiki set <view-id> --file docs/monthly_revenue.md
-- Tag a view
-- Use Tags API or CLI: dremio tag set <view-id> --tags "certified,finance,gold"
| Task | Frequency | SQL |
|---|---|---|
| OPTIMIZE (compact files) | Daily or after large loads | OPTIMIZE TABLE catalog.schema.table |
| VACUUM (expire snapshots) | Weekly | VACUUM TABLE ... EXPIRE SNAPSHOTS OLDER_THAN '...' |
| Refresh Silver tables | Per pipeline schedule | MERGE INTO or INSERT INTO ... SELECT |
| Reflection refresh | Automatic (Dremio handles) | Monitor via sys.reflections |