Clustered Key and Sorted Key
When creating a table, specifying data distribution and sorting methods using the CLUSTERED BY and SORTED BY clauses can significantly improve performance in specific query scenarios.
Syntax
Parameter Description
| Parameter | Required | Description |
|---|---|---|
CLUSTERED BY (col_name, ...) | No | Specifies the clustering key columns. The system computes a hash on these column values and distributes data across buckets. |
SORTED BY (col_name [ASC|DESC], ...) | No | Specifies the sorting key columns within each bucket. Data is stored in this order within each bucket. Ascending by default. |
INTO num BUCKETS | Required with CLUSTERED BY | Specifies the number of buckets. It is recommended that each bucket holds approximately 128 MB–1 GB of data. |
Clustered Key Description
The clustered key determines how data is distributed across buckets. The system computes a hash on the clustered key column values, and data with the same hash value falls into the same bucket.
Criteria for selecting clustered keys:
- Choose columns with a wide range of values and few duplicates to achieve even data distribution and avoid data skew.
- If a column is frequently used as a
JOINkey in queries, setting it as the clustered key enables Bucket Join, significantly improving JOIN performance. - When no clustered key is specified, the system defaults to 256 buckets.
- Too many buckets result in a large number of small files, impacting metadata management and I/O efficiency. Too few buckets fail to fully utilize parallel processing capabilities.
Sorted Key Description
The sorted key defines the physical sort order of data within each bucket. For queries that filter or sort by the sorted key, pre-sorted data reduces the scan range and improves query performance.
- Each sort column can be specified as
ASC(ascending, default) orDESC(descending). - Sorted keys have some impact on write performance. Sorting during large-scale data writes consumes additional resources.
Examples
- Create an orders table clustered by
customer_idand sorted byorder_date:
When querying by customer_id, the system only needs to scan the corresponding bucket without a full table scan:
- Create a table with a multi-column clustered key, suitable for multi-dimensional JOIN scenarios:
- Specify only the clustered key without a sorted key:
- Combine partitioning and clustering (data is clustered within each partition):
Notes
SORTED BYmust be used together withCLUSTERED BYand cannot be specified alone.- It is recommended to estimate the number of buckets based on data volume: aim for approximately 128 MB–1 GB of data per bucket.
- Clustered keys and sorted keys cannot be modified via
ALTER TABLEafter table creation. The table must be recreated. - For partitioned tables, clustering is performed independently within each partition, and the bucket count applies to each partition.
