Singdata Lakehouse Primary Key Usage Guide
Primary Key Constraint Overview
PRIMARY KEY is used to declare the unique identifier column(s) for records in a table. In big data scenarios, validating uniqueness for every write operation is extremely expensive, so the Lakehouse primary key constraint provides flexible behavior configuration. In the Lakehouse architecture, when a table with a defined primary key undergoes real-time data writes, the system automatically deduplicates data based on the primary key value. This is especially important in Change Data Capture (CDC) scenarios. For example, you can synchronize MySQL database binlog logs to the Lakehouse in real-time to ensure data consistency. After setting up the primary key, you need to handle data through the real-time data interface.
Lakehouse Primary Key Support and Default Behavior
Lakehouse supports two ways to directly specify a primary key, and by default, its behavior is set to ENABLE VALIDATE RELY. This means that when you specify a primary key when creating a table and do not explicitly specify other behaviors, the system automatically enables primary key validation and dependency.
Under this default behavior, whether it is a real-time write operation or a data write via SQL, the system will deduplicate based on the defined primary key. If an attempt is made to insert a record with a duplicate primary key value, the system will reject the insert operation to ensure primary key uniqueness. For example:
From the above example, it can be seen that in the default ENABLE VALIDATE RELY mode, the system strictly enforces primary key uniqueness, and primary key conflict checks are performed for both single-record inserts and batch inserts.
Customizing Primary Key Behavior
If, based on actual business requirements, you want primary key deduplication to be performed only by the real-time write mechanism and not by SQL writes, you can achieve this by setting the primary key behavior to DISABLE NOVALIDATE RELY. Here is a specific example:
Note that in DISABLE NOVALIDATE RELY mode, while real-time writes still deduplicate based on the primary key, SQL write operations are not constrained by primary key uniqueness, which may result in duplicate primary keys in the data. Therefore, when choosing this mode, you need to consider and manage the source and method of data writes to avoid potential data quality issues.
Primary Key Table Creation Syntax
- Primary key table without partitioning or bucketing
- Primary key table with bucketing.
-
When DISABLE NOVALIDATE RELY is not specified:
- Cluster Key must include the primary key column(s)
- Sort Key must include the primary key
-
When DISABLE NOVALIDATE RELY is specified:
- Cluster Key does not need to include the primary key column(s)
- Sort Key does not need to include the primary key
-
- Primary key table with partitioning.
- When DISABLE NOVALIDATE RELY is not specified:
- Partition Key must include the primary key column(s)
- When DISABLE NOVALIDATE RELY is specified:
- Partition Key does not need to include the primary key column(s)
- When DISABLE NOVALIDATE RELY is not specified:
Usage Notes
Type Selection Recommendations
- Prefer numeric types (such as int/bigint)
- Avoid variable-length types (such as string/varchar) to reduce index space usage
- Do not use floating-point types (float/double)
- Do not use nested types
Constraints
- Primary key columns must be defined as NOT NULL
- Schema Evolution is not supported (primary key cannot be modified after table schema changes)
Usage Examples
Create Table
When creating a Lakehouse table, you can specify a primary key. Primary key tables support two write modes:
ENABLE VALIDATE RELY(default): Primary key uniqueness is enforced during SQL writes (INSERT), and inserts with duplicate primary keys are rejected.DISABLE NOVALIDATE RELY: Primary key uniqueness is not validated during SQL writes, allowing duplicate primary key writes; however, real-time write interface (CDC) writes still deduplicate by primary key.
Below is an example of creating a table:
SDK Real-time Write Stream
Create Real-time Write Stream
When creating a real-time write stream using the SDK, you need to specify the operation type (CDC) and related options. Below is an example of creating a real-time write stream:
Specify Operation Type
Depending on requirements, different operation types can be specified:
Stream.Operator.UPSERT: Insert or update a row. If the target row does not exist, insert it; if it already exists, update it.Stream.Operator.DELETE_IGNORE: Delete a row. If the target row does not exist, it is automatically ignored.
Write Using the Native Java SDK
Write Using Lakehouse Real-time Sync
Refer to the documentation Multi-table Real-time Sync
Write Using FLINK CONNECTOR
Flink connector is built on the RealtimeStream SDK and used for real-time data synchronization. See Flink Connector
References
Java Real-time Programming Interface
Using Java SDK to Read Kafka Data for Real-time Upload
Singdata Lakehouse Multi-table Real-time Sync Implementing CDC
