Table
A Table is the basic unit for storing data in the Lakehouse. Unlike row-oriented databases such as MySQL, Lakehouse tables use Parquet columnar storage — data is organized and compressed by column. Queries only need to read the relevant columns, dramatically reducing I/O. Unlike Hive's static partitioning, Lakehouse uses a hidden partitioning mechanism similar to Iceberg, where the partitioning strategy can be changed without affecting the data.
Core Features
Columnar storage: Uses Parquet format by default. Queries only read the columns they need, making it well-suited for large-scale analytics.
Primary key support: Once a primary key is defined, the system automatically deduplicates by primary key during real-time writes, which is ideal for CDC sync scenarios. The default behavior is ENABLE VALIDATE RELY, meaning SQL writes also enforce primary key checks.
Time Travel: Based on MVCC version management, historical versions of data are retained, allowing you to query the state of data at any past point in time.
Partitioning and bucketing: Supports hidden partitioning (similar to Iceberg) and hash bucketing (CLUSTER BY) to optimize query performance.
When to Use a Table?
Suitable Scenarios
| Scenario | Reason |
|---|---|
| ODS layer raw data ingestion | Requires precise control over write timing and preserving raw data |
| Dimension tables | Small data volume, infrequent updates, needs JOIN acceleration |
| CDC sync target table | Sync jobs write directly; pairs with Table Stream to consume changes |
| Temporary data staging | Fast writes, later consumed by Dynamic Tables or scheduled jobs |
Unsuitable Scenarios
| Scenario | Recommended Alternative | Reason |
|---|---|---|
| Aggregation results that need automatic refresh | Dynamic Table | Tables require manual data maintenance |
| Frequently queried intermediate results | Materialized View | Materialized views provide transparent acceleration without changing SQL |
| Federated queries over external data | External Table | No need to migrate data into the Lakehouse |
Quick Example
Creating a Partitioned Table with a Primary Key
Table Design Best Practices
1. Partition Design
Principle: Partition by the filter field most commonly used in queries — typically a time field. Use transform partitions (days/months/years) to reduce cardinality.
2. Bucket Design
Principle: Bucket by fields commonly used in JOINs or GROUP BYs to optimize shuffle operations.
3. Primary Key Design
Principle: Define a primary key for CDC sync scenarios so the system automatically deduplicates by primary key.
Common Issues
Issue 1: Partition Field Cardinality Too High
Problem: Partitioning by a high-cardinality field (such as user_id) causes partition explosion.
Symptom: Write fails with The count of dynamic partitions exceeds the maximum number 2048.
Solution:
- Keep the number of distinct values in the partition field within a reasonable range
- Prefer time fields with transform partitions (
days/months/years) - Avoid partitioning by highly dispersed fields like user IDs or order IDs
Issue 2: Time Travel Storage Cost
Problem: Long Time Travel retention is enabled on a frequently updated table, causing historical versions to consume large amounts of storage.
Solution:
- Keep the default 1-day retention for non-critical tables
- Set
PROPERTIES ('data_retention_days'='0')on temporary tables to disable Time Travel - Periodically check Time Travel usage:
DESC TABLE <table> EXTENDED
Issue 3: Primary Key Behavior Not as Expected
Problem: With the default primary key behavior, SQL writes also enforce primary key checks, causing bulk inserts to fail.
Solution:
- If you only need deduplication for real-time writes:
PRIMARY KEY DISABLE NOVALIDATE RELY - If you need strict uniqueness: keep the default
ENABLE VALIDATE RELY
Cost Implications
Storage Cost
- Data is compressed and stored in Parquet format, typically achieving a 3–10x compression ratio
- Time Travel retains historical versions, increasing storage (retention days × average daily change volume)
Compute Cost
- Writing data consumes VCluster CRU
- Query cost depends on the amount of data scanned (partitions/indexes can reduce scan volume)
Lifecycle Management
Data Archiving and Cleanup
Related Documentation
- CREATE TABLE — Table creation syntax
- ALTER TABLE — Modifying table structure
- Partitioning — Partition design guide
- Bucketing — Bucket design guide
- Time Travel — Querying historical data
- OPTIMIZE — Small file compaction
- Data Lifecycle — Automatically cleaning up expired data
