Singdata Lakehouse Table Stream Best Practices Guide
The Role of Table Stream in Enterprise Data Organization
In modern data-driven enterprises, real-time capture and processing of data changes has become a critical capability. Enterprise data organizations typically face the following challenges:
- Decision delays caused by cross-system data synchronization latency
- Difficulties with incremental updates in complex ETL processes
- Complexity in tracking and auditing data change history
- Challenges in implementing real-time data integration and event-driven architectures
Singdata Lakehouse's Table Stream feature is a core component designed to address these challenges. It plays a key role in enterprise data organization:
- Data Integration Hub: Serves as the core mechanism for Change Data Capture (CDC), facilitating real-time data flow between different systems
- Data Quality Assurance: Provides traceability of data changes, supporting data lineage and impact analysis
- Real-Time Analytics Foundation: Delivers data change streams for real-time data warehouses, instant reports, and dashboards
- Event-Driven Triggers: Acts as an event source to drive downstream business processes and automated operations
- Data Governance Pillar: Supports compliance requirements by recording change history of sensitive data
In the data architecture, Table Stream connects OLTP systems and analytical systems, enabling enterprises to build a modern data platform with unified batch and streaming, improving data timeliness and business responsiveness.
Table of Contents
- Introduction
- Preparation
- Creation and Configuration
- Using Different Modes
- Consuming and Processing Data
- Using Metadata Fields
- Real-World Application Scenarios
- Performance Optimization
- Common Issues and Solutions
- Best Practices Summary
1. Introduction
1.1 What Is Table Stream
Table Stream is a core feature of the Singdata Lakehouse architecture, providing Change Data Capture (CDC) capabilities to record insert, update, and delete operations on table data. It creates a "change table" that enables users to query and consume row-level change records between two transaction time points.
1.2 Core Features
- Change Capture: Records table-level DML operations (INSERT, UPDATE, DELETE)
- Metadata Recording: Provides metadata such as version and timestamp for each change
- Incremental Processing: Supports incremental reading and processing of data changes
- Consumption Mechanism: Supports consuming change data and advancing the offset through DML operations
1.3 Applicable Scenarios
- Data synchronization and replication
- Real-time data integration
- Incremental ETL/ELT processes
- Auditing and data governance
- Event-driven architectures
2. Preparation
2.1 Table Configuration Requirements
Before using Table Stream, you must ensure the source table is properly configured:
2.2 Creating a Table Stream
A Table Stream can be created directly on any regular table without additional configuration:
2.3 Preparing the Target Table
If you plan to write Stream data to a target table, create a target table with a compatible structure in advance:
3. Creation and Configuration
3.1 Basic Syntax
Basic syntax for creating a Table Stream:
3.2 Important Parameters
3.2.1 TABLE_STREAM_MODE
- STANDARD: Captures all DML operations (INSERT, UPDATE, DELETE), reflecting the current state of the table
- APPEND_ONLY: Captures only INSERT operations, preserving original INSERT records even if rows are updated or deleted
3.2.2 SHOW_INITIAL_ROWS
- TRUE: Returns all existing rows from the table when the Stream was created on first consumption
- FALSE (default): Returns only new changes after the Stream was created on first consumption
3.3 Time Point Configuration
You can specify the time point at which the Stream starts capturing changes:
Best Practice: Use current_timestamp() or a specific timestamp string, avoiding complex time expressions.
3.4 Adding Comments
Add descriptive comments to the Stream:
Note: Use the correct syntax COMMENT 'comment content', not COMMENT = 'comment content'.
3.5 Multiple Consumer Pattern
A single Stream can only be fully consumed by one consumer. Once task A consumes the Stream via a DML operation, the offset advances, and when task B queries the same Stream, that batch of change data is already gone.
If multiple downstream tasks (or different downstream systems) all need to consume changes from the same table, create a separate Stream for each consumer:
Each Stream maintains its own offset independently, so A consuming does not affect B or C. A Stream only stores the offset and does not replicate table data, so the additional storage cost of creating multiple Streams is minimal.
4. Using Different Modes
4.1 STANDARD Mode
Recommended Use: When you need the complete current state of the table, including update and delete operations.
Characteristics:
- Accurately reflects the current state of the table
- Updates show the latest values
- Deleted rows do not appear in the results
4.2 APPEND_ONLY Mode
Recommended Use: When you need to preserve all insert records, including those subsequently updated or deleted.
Characteristics:
- Records all INSERT operations
- Does not reflect UPDATE and DELETE operations
- Original INSERT records are preserved even if rows are deleted
4.3 Mode Selection Guide
| Requirement | Recommended Mode |
|---|---|
| Data synchronization (keep target consistent with source) | STANDARD |
| Auditing all insert records | APPEND_ONLY |
| Incremental ETL processes | STANDARD |
| Historical record preservation | APPEND_ONLY |
5. Consuming and Processing Data
5.1 Offset Advancement Rules
Understanding when the offset advances is fundamental to using Table Stream correctly.
Core rule: the offset only advances after a DML transaction that includes the Stream is successfully committed.
| Operation | Offset Advances? |
|---|---|
SELECT * FROM stream | No |
INSERT INTO t SELECT ... FROM stream (successfully committed) | Yes |
INSERT INTO t SELECT ... FROM stream WHERE ... (successfully committed) | Yes (even if WHERE filters out some data) |
| Transaction containing Stream is rolled back | No |
| Transaction containing Stream fails | No |
WHERE conditions do not affect offset advancement: even if a WHERE clause filters out most of the data, as long as the DML transaction is successfully committed, all data in the Stream is consumed and the offset advances to the current position.
If you want to process only part of the data without losing the rest, first consume everything into a staging table, then filter from there:
5.2 Querying Stream Data
Important: Using only SELECT queries will not advance the Stream's offset. You can view the same batch of change data repeatedly.
5.3 Consuming and Advancing Offset
To advance the Stream's offset (consume data), you must use DML operations:
5.4 Consumption Modes
5.4.1 Full Consumption
5.4.2 Conditional Consumption (note risk of data loss)
Note: Even when using a WHERE condition, the offset for all Stream data will still advance. If you need to retain filtered-out data, first consume everything into a staging table.
5.5 Verifying Consumption Status
Verify whether data has been consumed by querying the Stream again:
If consumption was successful, COUNT should be 0 or contain only new change data.
6. Using Metadata Fields
6.1 Available Metadata Fields
The results returned by Table Stream include the following metadata fields:
__change_type: Change type__commit_version: Commit version__commit_timestamp: Commit timestamp
6.2 Change Type Reference
In STANDARD mode, the __change_type field takes the following four values:
__change_type | Meaning | Notes |
|---|---|---|
INSERT | New row | Source table executed an INSERT |
UPDATE_BEFORE | Old value before update | Paired with UPDATE_AFTER; __commit_version is the old version number |
UPDATE_AFTER | New value after update | Paired with UPDATE_BEFORE; __commit_version is the new version number |
DELETE | Deleted row | Source table executed a DELETE; field values of the deleted row are preserved |
An UPDATE operation produces two rows: UPDATE_BEFORE (old value) and UPDATE_AFTER (new value). Both rows share the same id but have different __commit_version values. This behavior is unrelated to the SHOW_INITIAL_ROWS parameter — it is consistent under both settings.
SHOW_INITIAL_ROWS controls whether data already in the table when the Stream was created is visible, and does not affect the values of __change_type:
FALSE(default): Data already in the table when the Stream was created is not visible; only changes that occur after the Stream is created are captured.TRUE: Data already in the table when the Stream was created is exposed asINSERTrecords. After the initial snapshot is consumed, subsequent changes produceUPDATE_BEFORE/UPDATE_AFTER/DELETEnormally.
Standard pattern for using a Stream in a MERGE statement:
In APPEND_ONLY mode, __change_type is always INSERT. UPDATE and DELETE operations produce no records.
6.3 Using Metadata for Incremental Processing
6.4 Metadata Field Best Practices
__change_typeis unrelated toSHOW_INITIAL_ROWS: Regardless of the default orTRUEsetting, in STANDARD mode UPDATE always producesUPDATE_BEFORE/UPDATE_AFTERand DELETE always producesDELETE- Always use the complete pattern in MERGE: distinguish
UPDATE_AFTER/DELETE/INSERTby__change_type;UPDATE_BEFORErows can be ignored - Use
__commit_versionand__commit_timestampto track the order of changes - Save the maximum version number consumed for disaster recovery
7. Real-World Application Scenarios
7.1 Real-Time Data Synchronization
This can be combined with scheduled tasks or triggers for automated synchronization.
7.2 Incremental ETL Process
7.3 Event-Driven Processing
7.4 Audit Trail
8. Performance Optimization
8.1 Reducing Data Volume
- Select only necessary columns rather than
SELECT * - Set appropriate retention periods on source tables
- Regularly consume Stream data to avoid accumulation
8.2 Batch Processing
8.3 Parallel Processing
Split large Streams into multiple smaller parts for parallel processing:
8.4 Frequency Optimization
- High change rate tables: Consume Streams more frequently
- Low change rate tables: Reduce consumption frequency
- Critical tables: Real-time or near-real-time consumption
- Non-critical tables: Batch periodic consumption
9. Common Issues and Solutions
9.1 Stream Not Capturing Changes
Issue: Stream fails to capture table changes after creation.
Solution:
- Confirm DML operations were executed after Stream creation
- Verify you have sufficient permissions
- Confirm that streaming write data has been committed (real-time writes may require waiting approximately 1 minute)
9.2 UPDATE_BEFORE/UPDATE_AFTER/DELETE Not Appearing in Stream
Issue: Querying a STANDARD mode Stream shows only INSERT records, with no UPDATE or DELETE records.
Root cause: The UPDATE/DELETE operations occurred before the Stream was created. A Stream can only capture changes that happen after it is created. The values of __change_type (INSERT/UPDATE_BEFORE/UPDATE_AFTER/DELETE) are unrelated to the SHOW_INITIAL_ROWS parameter.
Solution:
- Confirm that UPDATE/DELETE operations were executed after the Stream was created.
- If you need to capture historical changes, use
TIMESTAMP AS OFto set the Stream's starting offset to a point in time before the operations occurred. - Use
DESC TABLE STREAMto checkcurrent_offset_timeand confirm whether the Stream's current offset is earlier than the time of the changes you expect to capture.
9.3 Duplicate Data Consumption
Issue: Repeatedly running consumption logic causes duplicate data in the target table.
Solution:
- Use MERGE statements instead of INSERT
- Implement idempotent processing
- Record the last consumed version and timestamp
9.4 Offset Not Advancing After Consumption
Issue: Querying again after consumption still returns the same data.
Solution:
- Ensure data is consumed using DML operations (INSERT, UPDATE, MERGE)
- Do not use only SELECT queries, which do not advance the offset
- Check whether the DML operation was successfully committed
10. Best Practices Summary
10.1 Design Principles
- Create directly: Table Stream can be created directly on any regular table without additional configuration
- Choose the right mode: Select STANDARD or APPEND_ONLY mode based on requirements
- Create a separate Stream for each consumer: Different downstream tasks cannot share the same Stream — the first consumer to consume it will make the data invisible to subsequent consumers
- Only DML advances the offset: SELECT does not consume data; WHERE conditions do not prevent the offset from advancing, and filtered-out data is discarded
- Consume regularly: Do not let Streams accumulate too much data; the consumption frequency should be well within the source table's
DATA_RETENTION_DAYS(default 1 day) - Understand the effect of
SHOW_INITIAL_ROWS: This parameter controls whether data already in the table when the Stream was created is visible; it does not affect the values of__change_type. Regardless of the setting, in STANDARD mode UPDATE always producesUPDATE_BEFORE/UPDATE_AFTERand DELETE always producesDELETE
10.2 Usage Checklist
- Confirm the source table exists and has the correct structure
- Select the appropriate Stream mode (STANDARD / APPEND_ONLY)
- Create a separate Stream for each downstream consumer
- Choose
SHOW_INITIAL_ROWSbased on requirements:'FALSE'(default): Data already in the table when the Stream was created is not visible; only changes after Stream creation are captured'TRUE': Data already in the table when the Stream was created is exposed asINSERTrecords in the first consumption- Under both settings, the
__change_typebehavior in STANDARD mode is identical (UPDATE producesUPDATE_BEFORE/UPDATE_AFTER, DELETE producesDELETE)
- Use DML operations to consume data (do not use SELECT only)
- Confirm whether data filtered by WHERE conditions can be discarded (if not, first consume everything into a staging table)
- Implement idempotent consumption mechanism (MERGE instead of INSERT)
- Consumption frequency < source table DATA_RETENTION_DAYS (default 1 day) to avoid Stream expiry
- Monitor Stream backlog and consumption latency
- Implement error handling and retry logic
10.3 Keys to Successful Implementation
- Understand the Mechanism: Master how Stream works and its limitations
- Test Properly: Fully test before deploying to production
- Maintain Regularly: Monitor and optimize Stream performance
- Record State: Track consumption status to ensure data consistency
- Design for Fault Tolerance: Consider disaster recovery and edge cases
By following these best practices, you will be able to fully leverage the Singdata Lakehouse Table Stream feature to build efficient and reliable data change capture and processing pipelines.
References
- Singdata Table Stream Documentation - Feature description and syntax reference
- Singdata Table Stream Creation Syntax - Detailed creation syntax and parameter descriptions
- Change Data Capture (CDC) Best Practices - General best practices related to change data capture
- Singdata SQL Reference Manual - Complete SQL syntax reference, including Table Stream related operations
Note: This guide is based on testing results from the Singdata Lakehouse version as of May 2025. Subsequent versions may change. Please regularly check the official documentation for the latest information.
