OPTIMIZE Command
Overview
OPTIMIZE is similar to a VACUUM operation in traditional databases — it merges multiple small files into larger files to accelerate subsequent queries without changing the actual data content in the table.
OPTIMIZE is the core operation command in Singdata Lakehouse for table data optimization and compaction. By consolidating small files, cleaning up delete markers, and reorganizing data layout, it can significantly improve query performance and storage efficiency. Although Lakehouse automatically performs file compaction periodically in the background by default, in scenarios with frequent updates or where fine-grained control over compaction frequency is needed, you can manually invoke this command to meet specific business requirements. This command supports both asynchronous and synchronous execution modes, providing flexible optimization solutions for different scenarios.
Syntax
- table_name (Required)
- The name of the target table to optimize, in the format
[schema_name.]table_name.
- WHERE predicate (Optional)
- Partition filter condition; must include a complete partition column match condition.
- Supported formats:
partition_column = 'value'or compound partitiondt='2023-01-01' AND region='us'.
- OPTIONS (Optional)
- Lakehouse reserved parameters for controlling optimization behavior.
Notes
- This feature can only run in a General Purpose Virtual Cluster (GENERAL PURPOSE VIRTUAL CLUSTER); it will not take effect in an analytical cluster.
Core Features
- Small File Compaction: Merges multiple small data files into larger files, reducing file metadata overhead.
- Delete Marker Cleanup: Cleans up delete markers generated by UPDATE/DELETE operations, reclaiming storage space.
- Data Reorganization: Rearranges data layout to improve query performance.
Execution Modes
1. Asynchronous Execution Mode (Default)
Asynchronous execution is the default behavior of OPTIMIZE. The operation runs in the background without blocking the current connection.
Characteristics
- Non-blocking: Returns a Job ID immediately; the operation executes in the background.
Syntax
2. Synchronous Execution Mode
Synchronous execution blocks the current connection until the optimization operation is fully completed before returning the result.
Characteristics
- Blocking: Only returns after the operation completes; the connection is occupied during this time.
- Real-time Feedback: Immediately obtains detailed execution statistics and success status.
- Applicable Scenarios: Development and testing, small table optimization, verifying optimization effects.
- Deterministic: Ensures the operation is fully completed.
Syntax
Usage Examples
Example 1: Asynchronous Optimization of an Entire Table (Default)
Asynchronous mode returns immediately; the optimization task runs in the background. If the table files are already compact enough, the response is:
If a compaction job is triggered, a Job ID is returned. You can check progress via SHOW JOBS.
Example 2: Synchronous Optimization of a Specific Partition
doc_test.orders is partitioned by order_date. You can run synchronous optimization on a single partition:
Synchronous mode blocks until the operation completes. Example response:
Example 3: Synchronous Optimization of an Entire Table
Applicable Scenarios
- Storage cleanup after extensive UPDATE/DELETE operations
- Regular maintenance tasks to free temporary storage space
- Optimizing the overall storage layout of tables
Related Guides
- Small File Compaction Optimization: when to run OPTIMIZE, compaction strategy selection, and coordination with automatic optimization
