| name | hologres-schema-generator |
| description | Hologres DDL schema design and table creation expert. Use for generating CREATE TABLE statements,
choosing storage formats (column/row/row-column), configuring table properties (distribution_key,
clustering_key, bitmap_columns, event_time_column), designing partition tables, selecting data types,
and optimizing table schemas for different query patterns.
Triggers: "hologres建表", "hologres create table", "hologres schema", "hologres DDL",
"distribution_key", "clustering_key", "bitmap_columns", "event_time_column", "列存", "行存",
"行列共存", "分区表", "hologres table design", "表设计", "hologres数据类型"
|
Prerequisites
This skill requires hologres-cli to be installed first:
pip install hologres-cli
export HOLOGRES_SKILL=hologres-schema-generator
All SQL execution depends on hologres-cli commands (hologres sql run --write, hologres table create).
Hologres Schema Design & Table Creation
Generate optimized Hologres DDL statements with proper storage format, indexing, and distribution strategies.
Information Gathering (IMPORTANT)
Before generating DDL, check whether the user has provided enough context. Many Hologres table properties are immutable after creation (orientation, distribution_key, clustering_key, event_time_column, primary key, partition column) — getting them wrong means recreating the table.
Required Information
If ANY of the following are unclear, ask the user before generating DDL:
| Dimension | Why Critical | Example Question |
|---|
| Query pattern | Determines orientation (column/row/row-column) | "This table is mainly used for OLAP analytics, KV point lookups, or both?" |
| Column definitions | Core schema structure | "Can you list the main columns and their rough types (text, number, timestamp, etc.)?" |
| Primary key / unique key | Determines UPSERT capability; immutable | "Does this table need deduplication or updates by a unique key? If so, which column(s)?" |
| Main query conditions | Determines distribution_key, clustering_key, bitmap | "What columns are most often used in WHERE, JOIN, or GROUP BY?" |
Conditionally Required
Ask these only when relevant signals appear:
| Signal | Follow-up Question |
|---|
| Table looks like time-series or log data | "What is the estimated daily data volume? (determines whether to use partitioning)" |
| User mentions JOIN with other tables | "Which column will this table JOIN on, and what is the distribution_key of the other table?" |
| Large data volume or mentions retention | "How long should data be retained? (recommend dynamic partition management for lifecycle)" |
| User mentions both analytics and serving | "What is the read pattern — batch scans, point lookups by ID, or both?" |
Can Use Defaults (No Need to Ask)
| Property | Default Behavior |
|---|
bitmap_columns | Auto-enabled for TEXT columns; can ALTER later |
dictionary_encoding_columns | Use :auto mode; can ALTER later |
storage_mode | Default hot; can ALTER later |
time_to_live_in_seconds | Not recommended. Deletion time is non-deterministic. Use dynamic partition management instead |
Gathering Flow
User request → Check available info
|
+-- Columns + query pattern + key info all clear?
| → Generate DDL directly
|
+-- Missing critical info?
| → Ask 1-3 focused questions (batch them, don't ask one at a time)
|
+-- Very vague request (e.g., "help me create a user table")?
→ Ask: 1) main columns 2) query pattern 3) key/dedup needs
Principle: Batch questions into a single round. Never ask more than 3-4 questions at once. If the user provides partial info, fill in reasonable defaults for the rest and explain your assumptions in comments.
Storage Formats
Choose the storage format based on your primary query pattern.
| Format | Syntax | Best For | Primary Key |
|---|
| Column store | orientation = 'column' | OLAP, aggregation, scan-heavy queries | Optional |
| Row store | orientation = 'row' | Point lookups, high-QPS KV queries | Required |
| Row-column store | orientation = 'row,column' | Mixed workloads (OLAP + point lookup) | Required |
Default recommendation: Use row,column (row-column store) when the workload is unclear. It handles both OLAP and point queries well.
Quick Start
CREATE TABLE orders (
order_id BIGINT NOT NULL,
user_id BIGINT,
amount DECIMAL(18,2),
status TEXT,
order_time TIMESTAMPTZ NOT NULL,
ds TEXT NOT NULL,
PRIMARY KEY (order_id)
)
PARTITION BY LIST (ds)
WITH (
orientation = 'column',
distribution_key = 'order_id',
clustering_key = 'order_time:asc',
event_time_column = 'order_time',
bitmap_columns = 'status,user_id'
);
CREATE TABLE user_profile (
user_id BIGINT NOT NULL,
name TEXT,
email TEXT,
tags JSONB,
updated_at TIMESTAMPTZ,
PRIMARY KEY (user_id)
)
WITH (
orientation = 'row',
distribution_key = 'user_id',
clustering_key = 'user_id'
);
CREATE TABLE Syntax (V2.1+)
CREATE TABLE [IF NOT EXISTS] [schema.]table_name (
column_name data_type [NOT NULL] [DEFAULT expr],
...
[PRIMARY KEY (col1 [, col2, ...])]
)
[PARTITION BY LIST (partition_column)]
WITH (
orientation = '{column | row | row,column}',
distribution_key = 'col1[,col2]',
clustering_key = 'col1[:asc|:desc][,col2[:asc|:desc]]',
event_time_column = 'col',
bitmap_columns = 'col1[,col2,...]',
dictionary_encoding_columns = 'col1[:auto|:on|:off][,...]',
time_to_live_in_seconds = 'N',
storage_mode = '{hot | cold}',
table_group = 'group_name'
);
Legacy Syntax (All Versions)
BEGIN;
CREATE TABLE table_name (...);
CALL set_table_property('table_name', 'orientation', 'column');
CALL set_table_property('table_name', 'distribution_key', 'col1');
CALL set_table_property('table_name', 'clustering_key', 'col1:asc');
CALL set_table_property('table_name', 'event_time_column', 'col1');
CALL set_table_property('table_name', 'bitmap_columns', 'col1,col2');
COMMIT;
Table Properties Reference
distribution_key (Distribution Key)
Controls how data is hash-distributed across shards. Critical for JOIN and GROUP BY performance.
| Rule | Description |
|---|
| Choose high-cardinality columns | Avoid data skew |
| Use JOIN/GROUP BY columns | Enable local computation, avoid shuffle |
| Max 2 columns | More columns reduce distribution effectiveness |
| Must be subset of PK | If table has a primary key |
| Immutable after creation | Cannot ALTER, must recreate table |
distribution_key = 'user_id'
distribution_key = 'order_id,user_id'
clustering_key (Clustering Key / Sorted Index)
Physically sorts data within files. Accelerates range queries and filters.
| Rule | Description |
|---|
| Use range-query columns | e.g., timestamp, date columns |
| Max 2-3 columns | More columns dilute sort benefit |
| Put high-selectivity column first | Most-filtered column goes first |
| Specify sort order | :asc (default) or :desc |
| Column store only | Row store uses PK as sort key |
clustering_key = 'order_time:asc'
clustering_key = 'ds:asc,order_time:asc'
event_time_column (Segment Key)
Organizes data files by time ranges. Enables file-level pruning for time-range queries.
| Rule | Description |
|---|
| Use time/date columns | Timestamp of data ingestion or event time |
| At most 1 column | Only one segment key per table |
| Column store only | Not applicable to row store |
| Combine with partition | Partition for coarse pruning, segment key for fine pruning |
event_time_column = 'event_time'
bitmap_columns (Bitmap Index)
Builds bitmap indexes for fast equality filtering on low-to-medium cardinality columns.
| Rule | Description |
|---|
| Use filter columns | Columns frequently in WHERE clause |
| Low-medium cardinality | status, type, region — NOT user_id |
| TEXT columns auto-enabled | Default bitmap for text columns |
| Can be added after creation | ALTER TABLE ... SET (bitmap_columns = ...) |
bitmap_columns = 'status,payment_type,region'
dictionary_encoding_columns (Dictionary Encoding)
Compresses text columns by mapping values to integers. Speeds up GROUP BY and aggregations.
| Rule | Description |
|---|
Use :auto mode | Let Hologres decide based on cardinality |
| Good for low-cardinality text | country, status, category |
| Avoid high-cardinality columns | user_id, order_id — no compression benefit |
| Don't set if unsure | Incorrect setting may hurt performance |
dictionary_encoding_columns = 'country:auto,status:auto'
time_to_live_in_seconds (TTL) — NOT RECOMMENDED
Do NOT use time_to_live_in_seconds. The actual deletion time is non-deterministic — data will be deleted at an arbitrary time after the specified TTL, not at a precise point. This makes it unreliable for data lifecycle management.
Recommended alternative: Use dynamic partition management — create daily/hourly partitions and drop old partitions on a schedule (via cron or scheduling system). This gives you precise, predictable data lifecycle control.
CREATE TABLE events (
event_id BIGINT NOT NULL,
ds TEXT NOT NULL,
PRIMARY KEY (ds, event_id)
) PARTITION BY LIST (ds)
WITH (orientation = 'column', distribution_key = 'event_id');
DROP TABLE IF EXISTS events_20251101;
Primary Key Design
| Storage Format | PK Requirement | Notes |
|---|
| Column store | Optional | Add PK only if needed for UPSERT |
| Row store | Required | PK drives the row-store index |
| Row-column store | Required | PK serves both point lookup and analytics |
Rules:
- Max 32 columns in composite PK
- PK columns must be NOT NULL and UNIQUE
- Prohibited types:
FLOAT, DOUBLE, NUMERIC, ARRAY, JSON, JSONB
- PK cannot be altered after creation — must recreate table
- Avoid
SERIAL as PK — causes table-level locks on write
Partition Table Design
Use partitions for large tables with time-based or categorical data.
CREATE TABLE events (
event_id BIGINT NOT NULL,
user_id BIGINT,
event_type TEXT,
ds TEXT NOT NULL,
PRIMARY KEY (ds, event_id)
)
PARTITION BY LIST (ds)
WITH (
orientation = 'column',
distribution_key = 'event_id',
clustering_key = 'ds:asc',
event_time_column = 'ds'
);
CREATE TABLE events_20251201 PARTITION OF events FOR VALUES IN ('20251201');
CREATE TABLE events_20251202 PARTITION OF events FOR VALUES IN ('20251202');
Rules:
- Only
LIST partitioning is supported
- Partition column must be part of PK (if PK exists)
- Supported partition column types:
TEXT, VARCHAR, INT, DATE (V1.3.22+)
- Skip daily partitions if daily data < 100M rows — use
event_time_column instead
- Always filter on partition column in queries for pruning
Scenario-Based Templates
1. High-QPS Point Lookup (KV)
CREATE TABLE user_kv (
user_id BIGINT NOT NULL PRIMARY KEY,
profile JSONB,
updated_at TIMESTAMPTZ
)
WITH (
orientation = 'row',
distribution_key = 'user_id'
);
2. Prefix Range Scan
CREATE TABLE order_lines (
order_id BIGINT NOT NULL,
line_no INT NOT NULL,
product_id BIGINT,
quantity INT,
PRIMARY KEY (order_id, line_no)
)
WITH (
orientation = 'row',
distribution_key = 'order_id',
clustering_key = 'order_id'
);
3. Time-Range Analytics (Partitioned)
CREATE TABLE page_views (
view_id BIGINT NOT NULL,
user_id BIGINT,
page_url TEXT,
view_time TIMESTAMPTZ,
ds TEXT NOT NULL,
PRIMARY KEY (ds, view_id)
)
PARTITION BY LIST (ds)
WITH (
orientation = 'column',
distribution_key = 'view_id',
clustering_key = 'view_time:asc',
event_time_column = 'view_time',
bitmap_columns = 'user_id,page_url'
);
4. Dimension Filtering (Non-Time)
CREATE TABLE product_stats (
product_id BIGINT NOT NULL PRIMARY KEY,
category TEXT,
brand TEXT,
sales_count BIGINT,
revenue DECIMAL(18,2)
)
WITH (
orientation = 'column',
distribution_key = 'product_id',
clustering_key = 'category:asc',
bitmap_columns = 'category,brand'
);
5. JOIN-Optimized Fact + Dimension
CREATE TABLE fact_orders (
order_id BIGINT NOT NULL PRIMARY KEY,
customer_id BIGINT,
amount DECIMAL(18,2),
order_date TEXT
)
WITH (
orientation = 'column',
distribution_key = 'customer_id'
);
CREATE TABLE dim_customers (
customer_id BIGINT NOT NULL PRIMARY KEY,
name TEXT,
region TEXT
)
WITH (
orientation = 'row,column',
distribution_key = 'customer_id'
);
6. Mixed Workload (OLAP + Serving)
CREATE TABLE realtime_metrics (
metric_id BIGINT NOT NULL,
device_id BIGINT NOT NULL,
value DOUBLE PRECISION,
ts TIMESTAMPTZ NOT NULL,
ds TEXT NOT NULL,
PRIMARY KEY (ds, device_id, metric_id)
)
PARTITION BY LIST (ds)
WITH (
orientation = 'row,column',
distribution_key = 'device_id',
clustering_key = 'ts:asc',
event_time_column = 'ts',
bitmap_columns = 'device_id'
);
References
Best Practices
- Choose storage format first — column for OLAP, row for KV, row-column when unsure
- Set
distribution_key to JOIN/GROUP BY columns — avoids cross-shard shuffle
- Set
clustering_key for range-query columns — improves time-range and filter queries
- Set
event_time_column for time-series data — enables file-level pruning
- Use
bitmap_columns for low-cardinality filter columns — status, type, region
- Partition by date only if daily data > 100M rows — otherwise use segment key alone
- PK must include partition column — required by Hologres for partition tables
- Align
distribution_key across JOINed tables — same column enables local join
- Use
WITH syntax (V2.1+) — cleaner than CALL set_table_property in transactions
- Avoid
SERIAL as primary key — causes table-level write locks