Lakehouse Partition Usage Guide

Document Goal

If you are migrating from another data platform to the Singdata Lakehouse, this document will help you fully leverage the advanced advantages of Lakehouse partitioning.

The Singdata Lakehouse partitioning is based on Apache Iceberg's hidden partitioning concept, offering significant advantages over traditional partitioning: no need to manually specify partition conditions, automatic partition pruning, and more flexible partition evolution. This document is aimed at experienced data engineers from major data platforms, covering migration scenarios from mainstream platforms such as Hive, Spark, MaxCompute, Snowflake, and Databricks, focusing on how to migrate successfully and maximize value.

What You Will Gain

  • Advanced Concept Understanding: Master the innovative value and usage of hidden partitioning
  • Migration Best Practices: Proven successful migration strategies and implementation steps
  • Performance Optimization Guidance: Practical tips to fully leverage Lakehouse partition performance advantages
  • Architecture Upgrade Solutions: Optimize complex partition architectures into more efficient Lakehouse solutions
  • Practical Verification Methods: Verification steps to ensure partitioned tables are correctly created and perform well

Core Advantages of Lakehouse Partitioning

  • Intelligent Partition Pruning: No need to manually specify partition conditions in queries; the system automatically optimizes
  • Simplified Partition Management: Say goodbye to complex partition maintenance and focus on business logic
  • Better Performance Predictability: Avoid over-partitioning, ensuring stable query performance
  • Modern Architecture Design: Advanced partition concepts based on Apache Iceberg

Understanding Core Differences and Advantages

5 Innovative Features of Lakehouse Partitioning

1. Intelligent Transform Partition Functions

Innovation Value: Avoid conflicts with standard SQL functions, providing more precise time partition control

-- Elegant Lakehouse design CREATE TABLE events PARTITIONED BY (days(event_date)); -- Precise day-level partitioning -- Why use plural forms? -- Avoids conflicts with SQL standard functions like year(), month() -- Provides clearer semantics: years = number of years, not extracting the year -- The return value is a computed number: days('2024-06-01') = 19875 -- years('2024-06-01') = 54 (year offset calculated from 1970)

Key Understanding Points for Migration:

  • years/months/days/hours are plural forms, with more accurate semantics
  • Transform partitions avoid logic conflicts, ensuring partition strategy consistency
  • Return values are computed numbers; the system automatically handles transformation logic
  • Time function calculation baseline: years from 1970, days from 1970-01-01

2. Automatic Optimization of Hidden Partitioning

Innovation Value: Users don't need to worry about partition implementation details, focusing on business logic

-- Clean query in Lakehouse SELECT * FROM sales WHERE order_date = '2024-06-01'; -- System automatically converts to partition condition, no manual partition specification needed -- Advantages compared to traditional approach -- 1. Cleaner queries, no complex partition conditions -- 2. Partition strategy changes don't affect query logic -- 3. System automatically selects the optimal partition scan strategy

3. Intelligent Partition Count Control

Design Philosophy: Partition count limits are a performance protection mechanism, encouraging better partition design

-- Lakehouse partition philosophy: Quality over quantity -- Traditional thinking: Partition as granularly as possible -- Lakehouse philosophy: Reasonable partition granularity + index optimization -- Recommended efficient design CREATE TABLE user_events ( event_id INT, user_id INT, event_data JSON, event_time TIMESTAMP_LTZ ) PARTITIONED BY (days(event_time)); -- Time partition (primary) CREATE BLOOMFILTER INDEX idx_user ON TABLE user_events(user_id); -- Index optimization (auxiliary) -- Advantages: Avoids small file issues, ensures each partition has sufficient data volume -- Limitation note: It is recommended to control single-operation partition count within a reasonable range

4. Logical Consistency of Transform Partitions

Design Principle: Avoid conflicting partition dimensions, ensuring clear partition logic

-- When using transform partitions, avoid combining time granularities that conflict logically -- For example, using both years() and months() causes conflicts -- Choose a single time granularity: days() is the most commonly recommended option

5. Type-Safe Partition Design

Security Guarantee: Prevent data write errors through type checking

-- Lakehouse type safety mechanism CREATE TABLE orders ( id INT, amount DOUBLE ) PARTITIONED BY (order_date STRING); -- Explicit STRING type -- Type mismatch will cause errors INSERT INTO orders VALUES (1, 100.0, '2024-06-01'); -- String written to STRING partition: correct INSERT INTO orders VALUES (1, 100.0, DATE('2024-06-01')); -- DATE written to STRING partition: error -- Recommended approach: Use transform partitions to avoid type issues CREATE TABLE orders_safe ( id INT, amount DOUBLE, order_timestamp TIMESTAMP_LTZ ) PARTITIONED BY (days(order_timestamp)); -- Let the system handle type conversion

Cognitive Upgrade: From Complex to Simple

Traditional Partitioning vs Lakehouse Partitioning

DimensionTraditional Partition MindsetLakehouse Partition PhilosophyAdvantage
Partition StrategyThe more granular the better, multi-dimensional partitioningReasonable granularity, key dimensionsAvoids small files, stable performance
Query MethodMust specify partition conditionsAutomatic partition pruningCleaner queries, easier maintenance
Type HandlingManual type conversionSystem-level type safetyFewer errors, higher reliability
Maintenance CostComplex partition managementSimplified partition maintenanceLower operational costs
Performance PredictabilityDepends on partition design experienceSystem-guaranteed performanceMore stable query performance

Partition Table Creation Verification (Best Practice)

Post-Creation Verification: Ensuring Partition Tables Work Correctly

Regardless of how you create the partition table (SQL, tools, scripts), it is recommended to verify immediately after creation. This is the best practice to ensure partition functionality works properly.

-- 1. Create partition table (native SQL recommended for syntax accuracy) CREATE TABLE orders_partitioned ( id INT, amount DOUBLE, order_date DATE ) PARTITIONED BY (days(order_date)); -- 2. Immediately verify the partition table was created correctly (mandatory step) SHOW PARTITIONS orders_partitioned; -- Correct: Shows partition list or empty list -- Abnormal: Error "not a partitioned table"

Complete Verification Checklist (Execute After Every Creation)

-- Verification Step 1: Confirm it is a partitioned table SHOW PARTITIONS orders_partitioned; -- Verification Step 2: Test data insertion and partition creation INSERT INTO orders_partitioned VALUES (1, 100.50, DATE('2024-06-01')); SHOW PARTITIONS orders_partitioned; -- Correct: Shows a partition like "days(order_date)=19875" -- Verification Step 3: Verify partition pruning is effective SELECT * FROM orders_partitioned WHERE order_date = '2024-06-01'; -- Should return data normally with good performance -- Verification Step 4: Check table structure DESCRIBE TABLE orders_partitioned; -- Confirm column structure is correct and partition field types match -- Verification Step 5: Test max partition retrieval (compatibility verification) SELECT max_pt('orders_partitioned'); -- Should return the maximum partition value, used for compatibility with similar features on the original platform

Solutions When Verification Fails

Verification Failure SymptomPossible CauseSolution
not a partitioned tableTable creation syntax error or tool creation anomalyRe-create with native SQL
implicit cast not allowedPartition field type mismatchCheck inserted data types
No partitions displayedPartition function syntax errorCheck if plural forms are used
No performance improvementQuery not utilizing partition fieldsOptimize WHERE conditions

Why Verification is Needed?

  • Ensure partition definition syntax is correct, avoiding performance issues
  • Detect configuration errors early, reducing subsequent troubleshooting costs
  • Verify partition strategy matches query patterns

SHOW PARTITIONS Complete Feature Guide

Basic Syntax and Advanced Usage

-- Complete syntax SHOW PARTITIONS [EXTENDED] table_name [ PARTITION ( partition_col_name = partition_col_val [, ...] ) ] [WHERE <expr>]

Basic Usage

-- View all partitions SHOW PARTITIONS sales_table; -- View partition details SHOW PARTITIONS EXTENDED sales_table; -- View specific partition SHOW PARTITIONS sales_table PARTITION (pt1 = '2023'); -- Multi-level partition filter SHOW PARTITIONS sales_table PARTITION (pt1 = '2023', pt2 = '01'); -- Limit results SHOW PARTITIONS EXTENDED sales_table LIMIT 10;

Advanced Usage: Partition Health Check Tool

Important Limitation: SHOW PARTITIONS does not support ORDER BY clause. Use WITH subquery if sorting is needed.

-- Partition health check SHOW PARTITIONS EXTENDED table_name WHERE bytes > 100*1024*1024; -- Partition info view (supports LIMIT) SHOW PARTITIONS EXTENDED table_name LIMIT 10; -- Partition sorting (via WITH subquery) WITH partition_info AS ( SELECT partitions, bytes, total_rows, total_files, created_time FROM (SHOW PARTITIONS EXTENDED table_name) ) SELECT * FROM partition_info ORDER BY CAST(bytes AS BIGINT) DESC LIMIT 10;

MAX_PT Function - Get Latest Partition

-- MAX_PT function: Gets the maximum partition value from a partitioned table -- Syntax: max_pt('schema_name.table_name' | 'table_name') -- Basic usage: Query data from the latest partition SELECT * FROM sales_table WHERE pt = max_pt('sales_table'); -- Cross-schema usage SELECT max_pt('prod_schema.sales_table'); -- Practical application scenarios: -- 1. Incremental data processing: Always process the latest partition INSERT INTO target_table SELECT * FROM source_table WHERE pt = max_pt('source_table'); -- 2. Data quality check: Check data quality of the latest partition SELECT COUNT(*), AVG(amount), MAX(created_time) FROM orders WHERE pt = max_pt('orders');

Migration Value of MAX_PT Function:

  • MaxCompute Users: Direct replacement for the original max_pt function, no query logic changes needed
  • Hive Users: Simplify complex max partition queries, improving development efficiency
  • Other Platform Users: Provides a convenient way to query the latest data

Advanced Partition Table Creation Guide

Partition + Bucket + Sort Combinations

-- Option 1: Partition + Bucket (recommended for hash distribution) CREATE TABLE events_clustered ( user_id INT, event_type STRING, event_data JSON, event_time TIMESTAMP_LTZ ) PARTITIONED BY (days(event_time)) CLUSTERED BY (user_id) INTO 32 BUCKETS; -- Option 2: Partition + Sort (recommended for range queries) CREATE TABLE events_sorted ( user_id INT, event_type STRING, event_data JSON, event_time TIMESTAMP_LTZ ) PARTITIONED BY (days(event_time)) SORTED BY (event_type); CREATE TABLE events_both ( user_id INT, event_type STRING, event_data JSON, event_time TIMESTAMP_LTZ ) PARTITIONED BY (days(event_time)) CLUSTERED BY (user_id) SORTED BY (event_type) INTO 32 BUCKETS;

bucket Function Usage Guide

Parameter Range and Recommendations:

-- Recommended bucket count range CREATE TABLE sales PARTITIONED BY ( days(sale_date), bucket(10, user_id) -- Recommended: reasonable range of 1-1000 ); -- bucket count selection guide: -- 1-10 buckets: Suitable for small data volume tables (< 1 million rows) -- 10-100 buckets: Suitable for medium data volume tables (1-10 million rows) -- 100-1000 buckets: Suitable for large data volume tables

Complex Data Type Support

-- Lakehouse supports partitioning for all modern data types CREATE TABLE modern_table ( user_id INT, user_profile STRUCT<name: STRING, age: INT, location: STRING>, tags ARRAY<STRING>, metadata MAP<STRING, STRING>, config JSON, created_date DATE, created_timestamp TIMESTAMP_LTZ, created_timestamp_ntz TIMESTAMP_NTZ ) PARTITIONED BY (days(created_date)); -- Supported partition field types: -- Basic types: INT, BIGINT, STRING, DATE, TIMESTAMP_LTZ, TIMESTAMP_NTZ -- Transform partitions: years(), months(), days(), hours(), bucket() -- Not supported: STRUCT, ARRAY, MAP, JSON as direct partition fields

Special Cases for Partition Values

-- NULL value partition handling CREATE TABLE user_regions ( user_id INT, region STRING -- Allows NULL values ) PARTITIONED BY (region); INSERT INTO user_regions VALUES (1, NULL); -- Result: Creates region=NULL partition -- Special character support INSERT INTO user_regions VALUES (2, 'beijing'), -- Normal characters (3, 'shang-hai'), -- Supports hyphens (4, 'guang_zhou'), -- Supports underscores (5, 'xi an'), -- Supports spaces (6, 'very_long_city_name_with_many_characters'); -- Supports long strings -- View partition results SHOW PARTITIONS user_regions; -- Result display: region=NULL, region=beijing, region=shang-hai, etc. -- Special character support summary: -- Supported: letters, numbers, underscores, hyphens, spaces, Chinese characters -- Length: Supports very long partition values (tested with 100+ characters) -- Note: Avoid special symbols like @#$% although they may be supported, it's not recommended


Complex Partition Migration Strategies

Multi-Level Partition Architecture Migration Challenges

If you implemented complex partition strategies on your original platform, you will encounter architectural limitations and design choices when migrating to the Lakehouse.

Migration Challenges for Composite Dimension Partitions

Composite Partitions on the Original Platform:

-- Common composite partitions in MaxCompute/Hive CREATE TABLE sales PARTITIONED BY ( dt STRING, -- Date: 20240601 region STRING, -- Region: beijing, shanghai channel STRING -- Channel: online, offline ); -- Partition count = number of dates x number of regions x number of channels -- Example: 365 x 10 x 3 = 10,950 partitions/year

Problems with Direct Migration:

Directly copying composite partitions will result in too many partitions. The recommended approach is to reduce partition dimensions by combining time partitions with indexes.

Strategy 1: Simplified Partition + Index

-- Simplified from 3-level partitions to 1 time partition + 2 indexes CREATE TABLE sales ( sale_id INT, amount DOUBLE, region STRING, channel STRING, sale_date DATE ) PARTITIONED BY (days(sale_date)); -- Partition count per year: ~365 -- Create index optimizations for non-partition query columns CREATE BLOOMFILTER INDEX idx_region ON TABLE sales(region); CREATE BLOOMFILTER INDEX idx_channel ON TABLE sales(channel); -- Query with partition pruning + index acceleration SELECT * FROM sales WHERE sale_date >= '2024-06-01' AND sale_date <= '2024-06-30' -- Partition pruning AND region = 'beijing' -- Index acceleration AND channel = 'online'; -- Index acceleration

Strategy 2: Hash Partition Redesign

-- Use hash function to reduce partition count CREATE TABLE sales ( sale_id INT, amount DOUBLE, region STRING, channel STRING, sale_date DATE ) PARTITIONED BY ( days(sale_date), -- Time partition bucket(10, region) -- Region hashed into 10 buckets ); -- Partition count = 365 x 10 = 3,650 partitions/year (manageable)

Partition Evolution and Maintenance Strategy Migration

Differences in Dynamic Partition Management

Partition Management on Original Platform:

-- Partition management in Hive -- 1. Dynamic partition auto-creation INSERT OVERWRITE TABLE target PARTITION(dt, region) SELECT ..., dt, region FROM source; -- Auto-creates all dt x region combinations -- 2. Partition repair MSCK REPAIR TABLE target; -- Auto-discover new partitions -- 3. Partition deletion ALTER TABLE target DROP PARTITION (dt<'20240101'); -- Batch deletion

Equivalent Implementation in Lakehouse:

-- 1. Dynamic partition creation (pay attention to partition count) -- Recommended to check partition count first SELECT COUNT(DISTINCT CONCAT(dt, '/', region)) FROM source; -- Process in batches (if count is large) INSERT INTO target SELECT ..., dt, region FROM source WHERE dt BETWEEN '20240601' AND '20240615'; -- 2. Partition discovery (automatic, no manual repair needed) -- Lakehouse automatically manages partition metadata -- 3. Partition cleanup (more powerful features) TRUNCATE TABLE target PARTITION (dt = '20240101');

Partition Lifecycle Management Migration

MaxCompute's Automatic Lifecycle:

-- Setting table-level lifecycle in MaxCompute ALTER TABLE events SET LIFECYCLE 90; -- Auto-delete after 90 days -- Partition-level lifecycle ALTER TABLE events PARTITION(dt='20240101') SET LIFECYCLE 30; -- Single partition 30 days

Partition Cleanup Features in Lakehouse:

-- Basic partition cleanup (for STRING partitions) TRUNCATE TABLE events PARTITION (dt = '20240101'); -- Transform partition cleanup (requires specific partition value) TRUNCATE TABLE events_with_days PARTITION (days(event_date) = 19875); -- Advanced feature: Conditional filter cleanup -- Delete all partitions older than 90 days (need to calculate specific partition values first) -- For STRING partitions: TRUNCATE TABLE events PARTITION (dt < date_format(date_sub(current_date(), 90), 'yyyyMMdd')); -- Composite condition cleanup: Delete data for specific date and region TRUNCATE TABLE sales PARTITION (days(sale_date) = 19875 AND region = 'beijing'); -- Batch partition cleanup: Clean multiple partitions simultaneously TRUNCATE TABLE logs PARTITION (days(log_date) = 19875), PARTITION (days(log_date) = 19876);

Partition Cleanup Best Practices:

-- 1. Check partitions to delete before cleanup SHOW PARTITIONS EXTENDED table_name WHERE dt < '2024-01-01'; -- 2. Clean large numbers of partitions in stages (avoid long table locks) -- For STRING partitions: TRUNCATE TABLE large_table PARTITION (dt = '20230101'); -- Then continue cleaning the next day's data -- For transform partitions, query the partition value first: -- SELECT days('2023-01-01'); -- Get the specific partition value TRUNCATE TABLE large_table_with_days PARTITION (days(date_col) = specific_partition_value); -- 3. Periodic cleanup scheduling script example -- Execute daily at 2 AM, clean partitions older than 30 days (STRING partition) TRUNCATE TABLE daily_logs PARTITION (dt < date_format(date_sub(current_date(), 30), 'yyyyMMdd'));

Performance Optimization Strategy Migration

Equivalent Implementation of Z-Order Optimization

Databricks Delta Lake Z-Order:

-- Multi-dimensional optimization in Delta Lake OPTIMIZE events ZORDER BY (user_id, event_type, timestamp); -- Achieves joint optimization of multiple fields

Equivalent Strategies in Lakehouse:

-- Strategy 1: Partition + Bucket combination (recommended) CREATE TABLE events ( user_id INT, event_type STRING, event_data JSON, timestamp TIMESTAMP_LTZ ) PARTITIONED BY (days(timestamp)) -- Time partition CLUSTERED BY (user_id) INTO 32 BUCKETS; -- User hash bucketing -- Strategy 2: Partition + Sort combination CREATE TABLE events_sorted ( user_id INT, event_type STRING, event_data JSON, timestamp TIMESTAMP_LTZ ) PARTITIONED BY (days(timestamp)) -- Time partition SORTED BY (event_type); -- Event type sort -- Strategy 3: Rewrite optimization (similar to OPTIMIZE) INSERT OVERWRITE events SELECT * FROM events ORDER BY user_id, event_type, timestamp; -- Manual data reordering

Partition Pruning Optimization Migration

Original Platform Partition Pruning Logic:

-- Complex partition filtering in Spark df.filter( (col("year") >= 2024) & (col("month") >= 6) & (col("day") >= 1) )

Equivalent Query in Lakehouse:

-- In Lakehouse, transform to range query on a single partition dimension SELECT * FROM events WHERE event_date >= '2024-06-01' AND event_date <= '2024-06-30' -- month range filtering AND region = 'beijing'; -- region = "beijing" -- Key: Transform complex partition logic into simple time range queries


Platform-Specific Migration Notes

Hive Users: Special Attention Required

Syntax Compatibility Differences

-- Hive syntax is still supported in Lakehouse CREATE TABLE hive_style ( order_id INT, amount DOUBLE ) PARTITIONED BY (dt STRING, region STRING); -- Fully compatible -- But note ADD COLUMN position issues -- Hive: New column added to the last position before partition columns -- Lakehouse: Recommend explicitly specifying column position ALTER TABLE hive_style ADD COLUMN new_col STRING AFTER amount; -- Explicit position

Dynamic Partition Configuration Differences

-- Configurations needed in Hive are not needed in Lakehouse -- set hive.exec.dynamic.partition=true; -- Not needed in Lakehouse -- set hive.exec.dynamic.partition.mode=nonstrict; -- Not needed in Lakehouse -- But pay attention to partition count management -- Hive: hive.exec.max.dynamic.partitions=1000 (adjustable) -- Lakehouse: Recommend controlling single-operation partition count within reasonable range

Additional Real Pain Points for Hive Users

-- Additional issues commonly encountered by Hive users: -- 1. Partition field type restrictions CREATE TABLE hive_table PARTITIONED BY (dt STRING); -- Hive partition fields typically must be STRING -- 2. Partition directory structure dependency -- Hive strictly depends on directory structures like /data/table/year=2024/month=06/day=01/ -- No such restriction in Lakehouse, more flexible -- 3. Frequent need for partition repair -- Hive: MSCK REPAIR TABLE table_name; -- Often needs manual repair -- Lakehouse: Auto-maintains metadata, no manual repair needed

Spark Users: Special Attention Required

DataFrame Write Method Differences

-- Common Spark DataFrame write approach -- df.write.mode("overwrite").partitionBy("date", "region").saveAsTable("table") -- Things to note when migrating to Lakehouse SQL -- 1. Ensure the table is already created and correctly partitioned CREATE TABLE spark_migrated PARTITIONED BY (date STRING, region STRING); -- 2. Use INSERT statements instead of saveAsTable INSERT OVERWRITE spark_migrated SELECT * FROM source_data;

Transform Function Name Mapping

Spark FunctionLakehouse FunctionNotes
year(col)years(col)Note plural form
month(col)months(col)Note plural form
dayofyear(col)days(col)Different function name
hour(col)hours(col)Note plural form

Additional Challenges for Spark Users

// Issues more commonly encountered by Spark users: // 1. Partition discovery issues spark.sql("MSCK REPAIR TABLE table_name") // Spark also has this issue // 2. Performance pitfalls of dynamic partition writes df.write.mode("append") .option("maxRecordsPerFile", "50000") // Control file size .partitionBy("date") .saveAsTable("table") // 3. Partition column auto-inference type issues df.write.partitionBy($"date".cast("string")) // Must cast to string

MaxCompute Users: Special Attention Required

Partition Usage Habit Adjustments

-- MaxCompute enforces partition conditions -- SELECT * FROM table WHERE pt='20240601'; -- Must include partition condition, otherwise error -- In Lakehouse, partition conditions are automatic SELECT * FROM table WHERE order_date='2024-06-01'; -- Automatic partition pruning, more flexible -- However, it is still recommended to include partition conditions in queries for optimal performance

Lifecycle Management Differences

-- MaxCompute's automatic lifecycle -- ALTER TABLE table SET LIFECYCLE 30; -- Auto-delete after 30 days -- In Lakehouse, manual management or scheduling is needed TRUNCATE TABLE table PARTITION (order_date = '2024-05-01'); -- Manual cleanup

Real Pain Points for MaxCompute Users

-- Issues most commonly encountered by MaxCompute users: -- 1. Forced partition filter habit -- MaxCompute: Queries without partition conditions will error directly -- Lakehouse: Allows full table scan, but recommend including partition conditions -- 2. INSERT OVERWRITE syntax differences for partitioned tables -- MaxCompute: INSERT OVERWRITE TABLE target PARTITION(dt='20240601') -- Lakehouse: INSERT OVERWRITE target ... (automatically identifies partitions) -- 3. Cross-project access syntax changes -- MaxCompute: SELECT * FROM project.table WHERE pt='20240601'; -- Lakehouse: SELECT * FROM catalog.schema.table WHERE pt='20240601';

Snowflake Users: Special Attention Required

Major Mindset Shift: From Automatic Optimization to Proactive Partition Design

Traditional Snowflake User Habits: You may not be highly concerned with underlying partition design

-- Typical usage pattern in traditional Snowflake CREATE TABLE orders (id INT, amount DOUBLE, order_date DATE); -- System auto-manages storage ALTER TABLE orders CLUSTER BY (order_date); -- Set clustering key, system auto-manages micro-partitions -- No need to consider partitions in queries SELECT * FROM orders WHERE amount > 1000;

Modern Snowflake Users (Iceberg Tables): Syntax is largely similar but with subtle differences

-- Snowflake Iceberg table CREATE ICEBERG TABLE orders_iceberg PARTITION BY (year(order_date)); -- Corresponding syntax in Lakehouse CREATE TABLE orders_lakehouse PARTITIONED BY (years(order_date)); -- Note plural form

Adjustments When Migrating to Lakehouse:

-- Traditional Snowflake thinking may be suboptimal in Lakehouse CREATE TABLE orders (id INT, amount DOUBLE, order_date DATE); -- Created a regular table, not a partitioned table -- Result: Query performance may be less than ideal -- Learn to proactively design partition strategies CREATE TABLE orders ( id INT, amount DOUBLE, order_date DATE ) PARTITIONED BY (days(order_date)); -- Proactive partition design -- Although pruning is automatic, partition design directly affects performance SELECT * FROM orders WHERE order_date >= '2024-06-01' -- Partition condition (recommended to include) AND amount > 1000; -- Business condition

Snowflake Users' Learning Path

-- Phase 1: Understand the value of partitioning -- Partition = Physically storing data separately by a field -- Purpose: Only scan relevant partitions during queries, not the entire table -- Phase 2: Learn partition design -- Ask yourself: What field do my queries most commonly filter on? -- Time fields: order_date, created_at, updated_at -- Business fields: region, department, customer_type -- Phase 3: Verify partition effectiveness -- Compare query performance of partitioned vs non-partitioned tables

Cognitive Focus Points for Snowflake Users

-- Key understanding points for Snowflake users: -- 1. Not all databases auto-optimize SELECT * FROM large_table WHERE complex_condition; -- Need to consider partition design -- 2. Importance of partition design -- Snowflake micro-partitions are automatic, but in Lakehouse proactive design is needed -- 3. File system optimization concepts -- Understand concepts like "small file problem", "partition count control" -- 4. Query optimization awareness SELECT * FROM large_partitioned_table WHERE order_date >= '2024-06-01'; -- Habit of including partition conditions in queries

Databricks Users: Special Attention Required

Delta Lake vs Iceberg Differences

-- Delta Lake partition syntax -- CREATE TABLE delta_table PARTITIONED BY (year, month); -- Equivalent syntax in Lakehouse (Iceberg) CREATE TABLE iceberg_table PARTITIONED BY (years(date_col)); -- Can only use single time granularity -- Cannot simultaneously use multiple time granularities like in Delta Lake

Optimization Command Differences

-- Delta Lake optimization commands -- OPTIMIZE table_name; -- OPTIMIZE table_name ZORDER BY (col1, col2); -- Similar effect achieved in Lakehouse through rewrite INSERT OVERWRITE table_name SELECT * FROM table_name; -- File merge optimization

Advanced Feature Migration for Databricks Users

-- Issues more commonly encountered by Databricks users: -- 1. Delta Lake time travel habits -- Delta: SELECT * FROM table TIMESTAMP AS OF '2024-01-01 00:00:00'; -- Lakehouse: SELECT * FROM table TIMESTAMP AS OF '2024-01-01 00:00:00'; -- Similar syntax -- 2. Heavy reliance on OPTIMIZE and Z-ORDER -- Delta: OPTIMIZE table ZORDER BY (col1, col2); -- This is a daily operation -- Lakehouse: Need to combine partition + sort + bucket for equivalent effect -- 3. Unity Catalog impact -- Delta: CREATE TABLE catalog.schema.table; -- Three-level namespace habit -- Lakehouse: CREATE TABLE schema.table; -- Two-level namespace


Practical Pitfall Avoidance Guide

Partition Creation Pitfalls

Do's and Don'ts Comparison

-- Wrong approach: Unclear partition design CREATE TABLE bad_partition ( id INT, data STRING, timestamp_col TIMESTAMP_LTZ ); -- No PARTITIONED BY, not a partitioned table INSERT INTO bad_partition VALUES (1, 'test', CURRENT_TIMESTAMP()); -- Later finding queries are slow and wanting to add partitions is too late -- Correct approach: Plan partition strategy in advance CREATE TABLE good_partition ( id INT, data STRING, timestamp_col TIMESTAMP_LTZ ) PARTITIONED BY (days(timestamp_col)); -- Immediately verify partition was created correctly SHOW PARTITIONS good_partition; -- Should execute normally without errors

Partition Field Type Selection

-- Easy pitfall in type selection CREATE TABLE date_type_trap ( id INT, order_date DATE -- DATE type ) PARTITIONED BY (order_date); -- May get type errors when inserting data INSERT INTO date_type_trap VALUES (1, '2024-06-01'); -- String vs DATE type -- Recommended approach: Use STRING type for partitioning CREATE TABLE string_partition ( id INT, order_date_str STRING -- Use STRING type to avoid type conversion issues ) PARTITIONED BY (order_date_str); -- Or use transform partitions CREATE TABLE transform_partition ( id INT, order_date DATE ) PARTITIONED BY (days(order_date)); -- Let the system handle type conversion

Data Write Pitfalls

Large Batch Write Strategy

-- For large batch writes, recommend processing in batches to avoid too many partitions at once INSERT INTO target_table SELECT * FROM source_table WHERE partition_col >= 'start_value' AND partition_col <= 'end_value';

Partition Data Consistency

-- Note: Timezone issues with transform partitions CREATE TABLE timezone_sensitive ( id INT, event_time TIMESTAMP_LTZ -- Timezone-aware timestamp ) PARTITIONED BY (days(event_time)); -- Same local time in different timezones may land in different partitions INSERT INTO timezone_sensitive VALUES (1, TIMESTAMP '2024-06-01 23:30:00 UTC'), -- UTC timezone (2, TIMESTAMP '2024-06-01 23:30:00'); -- System default timezone -- View partition distribution SHOW PARTITIONS timezone_sensitive; -- May see two different partition values: days(event_time)=19875 and days(event_time)=19876

Query Performance Pitfalls

Scenarios Where Partition Pruning Fails

-- Query that cannot leverage partition pruning SELECT * FROM partitioned_table WHERE YEAR(order_date) = 2024; -- Function wrapping partition field SELECT * FROM partitioned_table WHERE order_date LIKE '2024%'; -- Fuzzy matching -- Queries that effectively leverage partition pruning SELECT * FROM partitioned_table WHERE order_date >= '2024-06-01' AND order_date <= '2024-06-30'; -- Range query on partition field

Quick Partition Troubleshooting

Problem: Partition table performs worse than non-partitioned table

Investigation Commands:

-- 1. Check partition count and size distribution SHOW PARTITIONS EXTENDED your_table; -- 2. Check for excessive small partitions SHOW PARTITIONS EXTENDED your_table WHERE bytes < 10*1024*1024; -- 3. Check if queries are utilizing partition pruning EXPLAIN SELECT * FROM your_table WHERE partition_col = 'value';

Common Causes and Solutions:

  • Over-partitioning (too many small partitions) -> Redesign partition granularity
  • Query doesn't include partition field -> Optimize query WHERE conditions
  • Not actually a partitioned table -> Rebuild with native SQL

Partition Performance Verification Practical Guide

Standard Method for Partition Effectiveness Verification

1. Create Comparison Test

-- Create partitioned and non-partitioned tables with the same data CREATE TABLE sales_partitioned ( id INT, amount DOUBLE, sale_date DATE ) PARTITIONED BY (days(sale_date)); CREATE TABLE sales_normal ( id INT, amount DOUBLE, sale_date DATE ); -- Insert the same large data volume for performance comparison testing

2. Performance Benchmark Test

-- Test partition query vs full table scan SELECT COUNT(*) FROM sales_partitioned WHERE sale_date >= '2024-06-01' AND sale_date <= '2024-06-30'; SELECT COUNT(*) FROM sales_normal WHERE sale_date >= '2024-06-01' AND sale_date <= '2024-06-30';

3. Partition Health Check

-- Check partition size distribution SHOW PARTITIONS EXTENDED table_name; -- Ideal state: -- Each partition 128MB - 1GB -- Partition sizes relatively uniform -- No large number of partitions smaller than 10MB -- No single partition exceeding 5GB

Performance Issue Diagnosis

Partition Table Slower Than Non-Partitioned Table?

Possible Causes and Solutions:

  1. Over-partitioning: Partition is too granular, metadata overhead is high

    -- Problem: Hourly partitions lead to many small partitions PARTITIONED BY (hours(timestamp)) -- Solution: Change to day-level partitions PARTITIONED BY (days(timestamp))

  2. Query not utilizing partitions: WHERE condition doesn't include partition field

    -- Cannot leverage partitions SELECT * FROM partitioned_table WHERE amount > 1000; -- Utilize partitions SELECT * FROM partitioned_table WHERE sale_date >= '2024-06-01' AND amount > 1000;

  3. Wrong partition field selection: Partition field is not the query hotspot

    -- Problem analysis: Check query patterns -- If queries mainly filter by user_id but partition is by date, it's ineffective -- Solution: Redesign partition strategy PARTITIONED BY (bucket(100, user_id)) -- Partition by user hash instead

Partition Tuning in Practice

1. Partition Granularity Selection

Data VolumeRecommended Granularity
> 1 billion rowsDay-level partitions days()
100M - 1B rowsMonth-level partitions months()
< 10M rowsMonth-level partitions months()

2. Composite Partition Optimization

-- Original composite partition problem CREATE TABLE sales_old PARTITIONED BY (region, sale_date, channel); -- Problem: 10 regions x 365 days x 3 channels = 10,950 partitions -- Optimization 1: Primary-secondary partitioning CREATE TABLE sales_optimized ( ..., region STRING, channel STRING ) PARTITIONED BY (days(sale_date)); -- Primary partition: time CREATE BLOOMFILTER INDEX idx_region ON TABLE sales_optimized(region); CREATE BLOOMFILTER INDEX idx_channel ON TABLE sales_optimized(channel); -- Optimization 2: Hash compression CREATE TABLE sales_hash PARTITIONED BY ( days(sale_date), -- Time partition: 365 bucket(5, region) -- Region hashed to 5 buckets ); -- Result: 365 x 5 = 1,825 partitions (manageable)

3. Partition Evolution Strategy

-- Periodic partition health check WITH partition_health AS ( SELECT partitions, total_rows, bytes, CASE WHEN CAST(bytes AS BIGINT) < 10*1024*1024 THEN 'TOO_SMALL' WHEN CAST(bytes AS BIGINT) > 5*1024*1024*1024 THEN 'TOO_LARGE' ELSE 'HEALTHY' END AS health_status FROM (SHOW PARTITIONS EXTENDED target_table) ) SELECT health_status, COUNT(*) as count FROM partition_health GROUP BY health_status;

Migration Case Studies

Case 1: Real-Time Data Table Migration (Complex Partition -> Simplified Partition)

Original Complex Partition Table

-- Original table: Multi-level time + business dimension composite partitions CREATE TABLE events_old PARTITIONED BY (year, month, day, hour, event_type); -- Problem: Too many partitions, complex queries, high maintenance cost

Lakehouse Migration Strategy

-- Redesign: Single time granularity partition + index CREATE TABLE events_new ( event_id INT, event_type STRING, event_data JSON, event_time TIMESTAMP_LTZ ) PARTITIONED BY (days(event_time)) SORTED BY (event_type); -- Query becomes cleaner SELECT * FROM events_new WHERE event_time >= '2024-06-01 08:00:00' AND event_time <= '2024-06-01 17:59:59'; -- Performance improvement: -- Query speed: 40% improvement (avoids small partition scanning) -- Maintenance cost: 60% reduction (partition count significantly reduced) -- Query complexity: 80% reduction (no need to calculate year, month, day, hour)

Case 2: Log Analysis Table Migration (MaxCompute -> Singdata Lakehouse)

Original MaxCompute Table

-- Original MaxCompute table: Strict partition restrictions CREATE TABLE app_logs ( user_id STRING, event_type STRING, event_data STRING ) PARTITIONED BY ( pt STRING, -- Format: 20240601 region STRING, -- Region: beijing, shanghai app_version STRING -- Version: 1.0, 1.1, 1.2 ); -- MaxCompute characteristics: -- Enforced partition conditions: SELECT must include WHERE pt='20240601' -- Partition count explosion: 365 x 10 x 20 = 73,000/year -- Data skew: Beijing region has much more data, other regions very little

Singdata Lakehouse Migration Strategy

-- Step 1: Analyze original query patterns -- Discovery: 90% of queries filter by time range + region -- Decision: Time as primary partition, region with index -- Step 2: Redesign partition architecture CREATE TABLE app_logs_new ( user_id STRING, event_type STRING, event_data STRING, region STRING, -- No partition, use index app_version STRING, -- No partition, use index log_date DATE ) PARTITIONED BY (days(log_date)); -- Only partition by time -- Step 3: Create indexes to optimize non-partition queries CREATE BLOOMFILTER INDEX idx_region ON TABLE app_logs_new(region); CREATE BLOOMFILTER INDEX idx_version ON TABLE app_logs_new(app_version); -- Step 4: Verify query performance -- Original query: SELECT * FROM app_logs WHERE pt='20240601' AND region='beijing'; -- New query: SELECT * FROM app_logs_new WHERE log_date='2024-06-01' AND region='beijing'; -- Result: Comparable performance, but partition count reduced from 73,000 to 365

Migration Process Challenges

-- Challenge 1: Inconsistent data types -- MaxCompute: pt STRING '20240601' -- Singdata Lakehouse: log_date DATE '2024-06-01' -- Solution: ETL conversion script INSERT INTO app_logs_new SELECT user_id, event_type, event_data, region, app_version, DATE(CONCAT( SUBSTR(pt, 1, 4), '-', SUBSTR(pt, 5, 2), '-', SUBSTR(pt, 7, 2) )) as log_date FROM maxcompute_source; -- Challenge 2: Habitual syntax adjustments needed -- MaxCompute habit: WHERE pt='20240601' (string exact match) -- Singdata Lakehouse: WHERE log_date='2024-06-01' (auto partition pruning) -- Challenge 3: Max partition query method changes -- MaxCompute original: WHERE pt = max_pt() -- Singdata Lakehouse new method: WHERE log_date = (SELECT DATE(max_pt('app_logs_new'))) -- Or redesign as STRING partition, using directly: WHERE pt = max_pt('app_logs_new')

Partition Maintenance Automation

-- MaxCompute-style automation script (adapted for new platform) -- 1. Daily cleanup of partitions older than 30 days TRUNCATE TABLE app_logs_new PARTITION (log_date < current_date() - INTERVAL '30' DAY); -- 2. Smart cleanup: Keep latest partitions, clean small partitions WITH old_partitions AS ( SELECT partitions FROM (SHOW PARTITIONS EXTENDED app_logs_new) WHERE CAST(bytes AS BIGINT) < 1000000 -- Partitions smaller than 1MB AND partitions < max_pt('app_logs_new') - INTERVAL '7' DAY ) -- Clean small partitions one by one (note: need specific partition values) -- 3. Partition health check script WITH partition_stats AS ( SELECT partitions, CAST(total_rows AS BIGINT) as rows, CAST(bytes AS BIGINT) as size_bytes FROM (SHOW PARTITIONS EXTENDED app_logs_new) ) SELECT COUNT(*) as total_partitions, AVG(size_bytes)/1024/1024 as avg_size_mb, SUM(CASE WHEN size_bytes < 10*1024*1024 THEN 1 ELSE 0 END) as small_partitions, max_pt('app_logs_new') as latest_partition FROM partition_stats;

Case 3: Real-Time Data Table Migration (Spark -> Singdata Lakehouse)

Original Spark Delta Table

-- Original Spark table: Hourly partitions + Z-Order optimization CREATE TABLE user_events USING DELTA PARTITIONED BY (date_hour STRING) -- Format: 2024060109 OPTIONS ( 'path' '/data/user_events' ); -- Periodic optimization OPTIMIZE user_events ZORDER BY (user_id, event_type);

Singdata Lakehouse Migration Challenges

-- Challenge 1: No direct Z-Order equivalent -- Challenge 2: Hourly partitions may be too granular -- Challenge 3: High real-time write performance requirements -- Solution 1: Partition + Bucket combination (recommended) CREATE TABLE user_events_new ( user_id INT, event_type STRING, event_data JSON, event_time TIMESTAMP_LTZ ) PARTITIONED BY (hours(event_time)) -- Keep hourly partitions (real-time requirement) CLUSTERED BY (user_id) INTO 32 BUCKETS; -- User ID bucketing -- Solution 2: Partition + Sort combination CREATE TABLE user_events_sorted ( user_id INT, event_type STRING, event_data JSON, event_time TIMESTAMP_LTZ ) PARTITIONED BY (hours(event_time)) -- Keep hourly partitions SORTED BY (event_type); -- Event type sort -- Achieving similar Z-Order effect: -- 1. Partition: Physical isolation by time -- 2. Bucket: Hash distribution by user ID -- 3. Sort: Clustered storage by event type

Performance Tuning Process

-- Tuning 1: Monitor partition sizes WITH partition_analysis AS ( SELECT partitions, CAST(bytes AS BIGINT)/1024/1024 as size_mb, CAST(total_rows AS BIGINT) as row_count FROM (SHOW PARTITIONS EXTENDED user_events_new) ) SELECT * FROM partition_analysis ORDER BY size_mb DESC LIMIT 10; -- Discovery: Night-time partitions too small (< 10MB) -- Solution: Dynamically adjust partition strategy -- Tuning 2: Hourly partitions during day, merged at night -- Implement smart partition strategy through ETL: -- 8:00-22:00 peak hours: Hourly partitions -- 22:00-8:00 off-peak: Merge to daily partitions

Case 4: Data Warehouse Migration (Snowflake -> Singdata Lakehouse)

Special Challenges for Snowflake Users

-- Traditional Snowflake: Automatic storage management CREATE TABLE sales ( order_id INT, customer_id INT, amount DECIMAL(10,2), order_date DATE ); -- Set clustering key (relatively transparent to user) ALTER TABLE sales CLUSTER BY (order_date); -- Queries: No need to worry about physical storage SELECT * FROM sales WHERE amount > 1000;

Singdata Lakehouse Learning Path

-- Phase 1: Understand partition concepts -- Question: What is partitioning? Why is it needed? -- Answer: Partitioning physically stores data separately by a field, scanning only relevant partitions during queries -- Phase 2: Learn partition design -- Question: Which field should I partition by? -- Analysis: Check the most commonly used WHERE conditions SELECT COUNT(*) as query_count, 'order_date filter' as filter_type FROM query_log WHERE query_text LIKE '%WHERE%order_date%' UNION ALL SELECT COUNT(*), 'customer_id filter' FROM query_log WHERE query_text LIKE '%WHERE%customer_id%'; -- Result: order_date filtering accounts for 80%, customer_id for 20% -- Decision: Partition by order_date, use index for customer_id -- Phase 3: Correct partition table design CREATE TABLE sales_partitioned ( order_id INT, customer_id INT, amount DECIMAL(10,2), order_date DATE ) PARTITIONED BY (days(order_date)); CREATE BLOOMFILTER INDEX idx_customer ON TABLE sales_partitioned(customer_id); -- Phase 4: Verify performance improvement -- Comparative queries: SELECT * FROM sales WHERE order_date = '2024-06-01'; -- Full table scan SELECT * FROM sales_partitioned WHERE order_date = '2024-06-01'; -- Partition scan -- Result: Partitioned tables typically show significant performance improvement

Common Adjustments and Real Confusions for Snowflake Users

-- Adjustment 1: Create partitioned tables instead of regular tables CREATE TABLE sales_correct ( order_id INT, amount DECIMAL(10,2), order_date DATE ) PARTITIONED BY (days(order_date)); -- This is a partitioned table -- Adjustment 2: Include partition conditions in queries SELECT * FROM sales_partitioned WHERE order_date >= '2024-06-01' -- Leverage partition AND amount > 1000; -- Business filter -- Adjustment 3: Understand the importance of partition fields SELECT * FROM sales_partitioned WHERE amount > 1000; -- Can be optimized to include partition conditions -- Adjustment 4: Not all queries will automatically be fast -- Snowflake users are accustomed to the system handling all storage optimization automatically -- Lakehouse requires more proactive design awareness

The Actual Process of Mindset Shift

### Common Questions from Snowflake Users: "Why are my queries fast on Snowflake but need partition consideration on Lakehouse?" **Answer**: Snowflake's automatic optimization vs Lakehouse's proactive partition design each have their advantages ### Typical Issues During the Learning Process: 1. "Why should I decide the partition strategy?" 2. "What is the small file problem? I never considered file sizes before" 3. "What does partition count control mean?" ### Breakthrough Moment: When users see significant performance advantages of partitioned tables over non-partitioned ones, they begin to understand the value of partitioning


Migration Verification Checklist

Partition Creation Verification

-- Checkpoint 1: Table is indeed a partitioned table SHOW PARTITIONS your_table_name; -- Should execute normally, not reporting "not a partitioned table" error -- Checkpoint 2: Partition field types are correct DESCRIBE TABLE your_table_name; -- Confirm partition field data types match expectations -- Checkpoint 3: Test data write INSERT INTO your_table_name VALUES (test_data); SHOW PARTITIONS your_table_name; -- Should see newly created partitions -- Checkpoint 4: Verify partition value format -- Transform partitions generate numeric values, e.g. days('2024-06-01') = 19875 -- years('2024-06-01') = 54 (year count from 1970) -- Confirm partition values meet expectations -- Checkpoint 5: Test max partition retrieval (compatibility verification) SELECT max_pt('your_table_name'); -- Should return the maximum partition value for compatibility with similar original platform features

Performance Verification

-- Checkpoint 6: Partition pruning is effective -- Compare execution time of these two queries SELECT COUNT(*) FROM your_table_name; -- Full table scan SELECT COUNT(*) FROM your_table_name WHERE partition_col = 'specific_value'; -- Partition query -- Partition query should be noticeably faster -- Checkpoint 7: Partition sizes are reasonable SHOW PARTITIONS EXTENDED your_table_name; -- Check each partition's size, ideal range: 128MB - 1GB

Compatibility Verification

-- Checkpoint 8: Whether original queries need modification -- Test original platform queries in Lakehouse -- Pay special attention to: -- 1. Whether transform function names need adjustment (year -> years) -- 2. Whether partition conditions can be automatically recognized -- 3. Whether data type conversions work correctly -- Checkpoint 9: Whether partition evolution strategies are feasible -- Verify partition cleanup and maintenance scripts work correctly

Common Error Self-Check

If you encounter the following errors, resolve them accordingly:

Error MessagePossible CauseSolution
not a partitioned tableMissing PARTITIONED BY or syntax error during table creationRe-create table with native SQL, adding partition definition
implicit cast not allowedPartition field type mismatchCheck inserted data types
exceeds maximum numberToo many partitions in a single operationAdjust parameters or process in batches
conflicts withTransform partition logic conflictChoose single time granularity
months conflicts with yearsMulti-level time partition design errorSwitch to days() single granularity
Syntax error at or near 'ORDER'SHOW PARTITIONS used with ORDER BYUse WITH subquery for sorting
cannot resolve column 'total_rows'Partition attributes used in TRUNCATE PARTITIONOnly partition field itself can be used
operator not foundType mismatch comparisonEnsure data types are consistent
duplicate.syntax.elementCLUSTERED BY and SORTED BY used togetherChoose one of the two syntaxes
Query performance worse than original platformPartition design may be unreasonableRe-evaluate partition strategy
Too many small partitionsToo many composite partition dimensionsReduce partition dimensions, use indexes instead
max_pt function not foundPossible table name error or permission issueCheck table name and schema permissions
TRUNCATE PARTITION failedPartition condition syntax errorCheck partition filter expression syntax

Performance Benchmark Acceptance Criteria

Basic Migration Success Indicators

  • Original query logic can run in Lakehouse without major modifications
  • Query performance meets or exceeds original platform levels
  • Partition maintenance workload is manageable and automated
  • Team members can independently handle common partition issues

Complex Migration Success Indicators

  • New partition strategy is simpler than original but performance is not degraded
  • Partition count is within reasonable range
  • Partition sizes are evenly distributed (128MB-1GB/partition)
  • Query patterns and partition design are highly aligned
  • Partition maintenance automation is no lower than original platform

Summary: Key Elements for Successful Migration

Core Mindset Shifts

  1. From Explicit Partitioning to Hidden Partitioning: No need to manually specify partition conditions in every query
  2. From Automatic Optimization to Proactive Design: Partition strategies need advance planning and design
  3. From Single Syntax to Diverse Choices: Multiple partition creation methods supported, choose the most suitable
  4. From Less Concern to Partition Awareness: Especially for Snowflake users, need to strengthen partition design awareness
  5. From Complex Partitioning to Simplified Design: Multi-level partitions need redesign to single granularity partitions

Pitfall Avoidance Summary

Basic Syntax Traps

  1. Remember plural forms for transform function names: years, months, days, hours
  2. Keep partition types consistent: Avoid mixing STRING and DATE types
  3. Control partition count within reasonable range: Avoid too many small partitions
  4. Design partitions based on query patterns: Design partition fields based on WHERE conditions
  5. Verify partition effectiveness promptly: Check partition correctness immediately after creation

Complex Migration Traps

  1. Multi-level time partitions cannot be directly migrated: year+month+day should change to days() single granularity
  2. Simplify composite partition dimensions: Too many dimensions lead to excessive partition count
  3. Rebuild partition evolution strategies: From automatic management to manual scheduled management
  4. Adjust performance optimization strategies: Z-Order and other advanced optimizations need redesign
  5. Supplement lifecycle management: From automatic cleanup to script-based periodic cleanup

Creation Verification Traps

  1. Recommend using native SQL to create partition tables: Ensure syntax accuracy
  2. Must verify after every creation: Execute the complete verification checklist
  3. Performance testing must compare baselines: Ensure partitioned tables are indeed faster than non-partitioned ones
  4. Monitor partition health status: Regularly check partition sizes and count distribution

Advanced Syntax Traps

  1. SHOW PARTITIONS does not support ORDER BY: Use WITH subquery for sorting
  2. bucket function parameters must be reasonable: Recommended positive integers in range 1-1000
  3. Note NULL values and special characters: The system will create corresponding partitions

Successful Migration Experience

For Simple Partition Scenarios

  • Incremental migration: Verify with small tables first, then migrate core large tables
  • Preserve original query logic: Try to let original SQL work without major changes
  • Performance baseline comparison: Compare query performance before and after migration
  • Complete verification process: Strictly follow the verification checklist

For Complex Partition Scenarios

  • Partition strategy redesign: Don't try to completely replicate the original partition structure
  • Performance verification first: Use representative queries to verify new partition strategy effectiveness
  • Phased implementation: Complex migration divided into four phases: analysis, redesign, testing, and cutover
  • Prepare rollback plan: Ensure quick rollback to original solution if migration fails

Special Reminders

For Snowflake and similar platform users: The challenge of migrating to Lakehouse mainly lies in mindset adjustment. You need to shift from "less concern about partitioning" to "proactive partition strategy design." It is recommended to first experience the impact of partitioning on query performance in a test environment, understand the value of partitioning, and then design production environment partition strategies.

For users with complex partition architectures: Don't try to completely replicate the original partition structure in the Lakehouse. The Lakehouse partition philosophy is "simplify without losing performance." Multi-level partitions, composite partitions, and other complex designs can often achieve the same or better results with simpler solutions in the Lakehouse.

For all migrating users: Partition table creation verification is the first and most critical step in successful migration. It is recommended to use native SQL to create partition tables and strictly execute the verification checklist. Many migration issues stem from partition tables not being correctly created, making subsequent performance optimization impossible.

Remember: Lakehouse partitioning is not a burden, but a powerful tool for performance optimization. When partitioning is used correctly, you will achieve better query performance and more flexible data management capabilities than on your original platform.


Quick Reference Card

Common Partition Management Commands

FunctionCommand SyntaxUse Case
View PartitionsSHOW PARTITIONS table_nameBasic partition viewing
Partition DetailsSHOW PARTITIONS EXTENDED table_nameView partition size, file count, etc.
Partition FilterSHOW PARTITIONS EXTENDED table WHERE bytes > 100*1024*1024Health check
Specific PartitionSHOW PARTITIONS table PARTITION (pt1 = '2023')View specific partition
Limit ResultsSHOW PARTITIONS table LIMIT 10Limit returned results
Max PartitionSELECT max_pt('table_name')Get latest partition value
Clean PartitionTRUNCATE TABLE table PARTITION (pt = 'value')Lifecycle management
Batch CleanTRUNCATE TABLE table PARTITION (pt1 = 'v1'), PARTITION (pt2 = 'v2')Composite condition cleanup

Platform Syntax Quick Reference

FunctionOriginal Platform SyntaxLakehouse SyntaxNotes
Year Partitionyear(date)years(date)Plural form, returns year count from 1970
Month Partitionmonth(date)months(date)Plural form
Day Partitionday(date)days(date)Plural form, returns day count from 1970-01-01
Hour Partitionhour(timestamp)hours(timestamp)Plural form
Composite Partition(year, month)days(date)Cannot combine conflicting transforms
Dynamic PartitionRequires enable configurationDefault supportPay attention to partition count control
Max Partitionmax_pt()max_pt('table_name')Must specify table name
Partition CleanupALTER TABLE DROP PARTITIONTRUNCATE TABLE PARTITIONMore flexible syntax

Quick Error Reference

When You See This ErrorImmediately CheckQuick Fix
not a partitioned tableWhether table creation statement has PARTITIONED BYRebuild table with native SQL
implicit cast not allowedWhether data types matchUse STRING partition uniformly or use transform partition
exceeds maximum numberWhether partition count is too highInsert in batches or adjust parameters
conflicts withWhether transform partitions conflictUse single time granularity
months conflicts with yearsMulti-level time partition design errorSwitch to days() single granularity
Syntax error at or near 'ORDER'SHOW PARTITIONS used with ORDER BYUse WITH subquery for sorting
cannot resolve column 'total_rows'Partition attributes used in TRUNCATE PARTITIONOnly partition field itself can be used
operator not foundType mismatch comparisonEnsure data types are consistent
duplicate.syntax.elementCLUSTERED BY and SORTED BY used togetherChoose one of the two syntaxes
Query performance worse than original platformWhether partition design is reasonableRe-evaluate partition strategy
Too many small partitionsToo many composite partition dimensionsReduce partition dimensions, use indexes instead
max_pt function not foundPossible table name error or permission issueCheck table name and schema permissions
TRUNCATE PARTITION failedPartition condition syntax errorCheck partition filter expression syntax

Partition Table Verification Quick Reference

Verification ItemCheck CommandExpected Result
Is Partitioned TableSHOW PARTITIONS table_nameNo error, shows partition list
Partition CreatedView partitions again after inserting dataShows new partition values
Type MatchDESCRIBE TABLE table_nameColumn types match expectations
Performance ImprovementCompare partition query vs full table scanPartition query noticeably faster
Latest PartitionSELECT max_pt('table_name')Returns max partition value
Partition HealthSHOW PARTITIONS EXTENDED table_namePartition sizes reasonably distributed

Advanced Syntax Quick Reference

FunctionCorrect SyntaxIncorrect Syntax
Partition + BucketPARTITIONED BY (days(date)) CLUSTERED BY (id) INTO 32 BUCKETSCorrect
Partition + SortPARTITIONED BY (days(date)) SORTED BY (name)Correct
Partition + Bucket + SortNot supported to use simultaneouslyWill report duplicate.syntax.element error
bucket Parameterbucket(10, user_id)Recommended range 1-1000
Partition SortWITH t AS (SHOW PARTITIONS ...) SELECT * FROM t ORDER BY ...Use subquery
NULL PartitionINSERT ... VALUES (1, NULL) -> col=NULL partitionSupported

Migration Priority Checklist

First Priority (Must Do)

  • Use native SQL to create partition tables
  • Execute complete verification checklist to ensure partition tables are correct
  • Compare partition table vs non-partitioned table performance
  • Verify original queries execute correctly on the new platform

Second Priority (Important)

  • Simplify complex partition structures (multi-level -> single-level)
  • Control partition count within reasonable range
  • Create indexes for high-frequency non-partition query columns
  • Establish partition health monitoring mechanism

Third Priority (Optimization)

  • Establish partition cleanup automation scripts
  • Team training on new partition concepts and operations
  • Continuous performance monitoring and tuning
  • Periodic evaluation and adjustment of partition strategies

Usage Recommendation: This document can serve as the authoritative reference for Lakehouse partition migration, with all examples directly usable in production environments. For complex migration projects, it is recommended to follow the verification checklist in the document step by step, ensuring each step is correctly verified.


Note: This document is compiled based on Lakehouse product documentation as of June 2025. It is recommended to periodically check the official documentation for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.