Develop Dynamic Tables for Near Real-Time Incremental Processing

Tutorial Overview

Through this tutorial, you will learn how to use Lakehouse Dynamic Table to build a complete real-time data ETL processing flow. The scenario design is as follows:

This tutorial will implement a streaming data processing pipeline case with minute-level latency through data import tasks, data cleaning and transformation based on dynamic tables, and data aggregation tasks.

Prerequisites

Dynamic Table is an object type that supports processing only incremental change data, similar to materialized views. When creating a dynamic table, you need to define the data computation logic. By performing a refresh action, you can trigger the computation logic and update the dynamic table. Compared to traditional ETL tasks, dynamic tables only require the declaration of full semantic computation logic at the time of definition, and dynamic tables can automatically perform incremental computation optimization during refresh. With dynamic tables, Lakehouse can perform near real-time incremental processing for streaming data, providing data preparation for real-time analysis.

Tutorial Steps

  1. Data Import: Create the raw table and continuously write user behavior data;
  2. Develop Dynamic Table Model: Construct data cleaning and data aggregation processing flows through Dynamic Table;
  3. Verify Processing Results: Monitor the refresh execution status of the data pipeline constructed by the dynamic table, and check the changes in the consumption layer data table to verify the real-time data processing results.

Through the above steps, you will learn how to develop streaming processing tasks in Lakehouse using dynamic tables.

Preparation

First, create a behavior log table and write test data through the INSERT INTO task.

This tutorial provides SQL scripts for creating data tables and inserting test data in the "Development" module. Open the [Tutorial_Working_With_Dynamic_Table->Step01.Preparation] script file, configure a 1-minute interval scheduling strategy, and submit the deployment to simulate real-time data import.

Transform Layer Dynamic Table Model Development

In the [Development] module, open the [Tutorial_Working_With_Dynamic_Table->Step02.Transformation_With_Dynamic_Table] sample dynamic table task file. This dynamic table defines the ETL logic for cleaning and transforming the raw table.

Refer to the figure below to configure the "Run Cluster" and "Scheduling Parameters":

Submit Deployment

Click the [Submit] button to deploy the dynamic table model to the Lakehouse target environment. The system will automatically refresh according to the refresh frequency set by the dynamic table.

Aggregate Layer Dynamic Table Model Development

In the [Development] module, open the [Tutorial_Working_With_Dynamic_Table->Step03.Aggregation_With_Dynamic_Table] sample dynamic table task file. This dynamic table defines the ETL logic for aggregating and analyzing the intermediate data model.

Please refer to the following image for scheduling configuration and submitting the run:

Verify Incremental Update Results

In the [Development] module, open the Step04.Check_Data_Freshness file. Check the data freshness of the automatically refreshed dynamic table by executing the Query. As shown in the image above, with the help of the dynamic table's automatic incremental refresh, real-time data processing can be completed with a delay of about 1 minute.

Environment Cleanup

In the [Task Operations] module, take the data import task offline from the periodic task list; take the two dynamic table model tasks offline from the dynamic table task list.

FAQ

Q1: Data is not updated after the Dynamic Table refreshes. How do I troubleshoot?

  • Run SHOW DYNAMIC TABLE REFRESH HISTORY to check the recent refresh status and confirm whether the refresh executed successfully
  • Confirm whether new data has been written to the source table (compare row counts before and after the refresh using SELECT COUNT(*))
  • Check whether the refresh interval is set appropriately — a long INTERVAL will cause data delays
  • If refresh_mode shows FULL instead of INCREMENTAL, the SQL contains operators that do not support incremental computation

Q2: Refresh fails with "incremental refresh not supported". What should I do?

  • Some SQL operators do not support incremental computation, such as window functions with ORDER BY or certain subqueries
  • Solution: simplify the SQL logic, or accept full refresh by using CREATE OR REPLACE DYNAMIC TABLE ... REFRESH_MODE = FULL to force full mode

Q3: Which VCluster should I use for refresh execution?

  • Refresh is a compute-intensive operation; it is recommended to use a GENERAL type cluster
  • Do not share the same cluster with query workloads to avoid refresh tasks affecting query response times
  • For high-frequency refreshes (< 1 minute), it is recommended to use a dedicated small-spec cluster (XSMALL)

Q4: When should I use Dynamic Tables vs. regular Studio scheduled tasks?

ScenarioRecommended approach
Pure SQL incremental processing with simple dependenciesDynamic Table (automatically manages dependencies and refresh)
Requires Python processing or calling external APIsStudio task
Multiple upstream/downstream tasks with complex dependency orchestrationStudio task flow
Requires precise execution timing (e.g. on the hour)Studio scheduled task