Data Transformation

Singdata Lakehouse covers four data transformation patterns: auto-incremental refresh pipelines (Dynamic Table), scheduled batch ETL (Studio task scheduling), change data capture (Table Stream), and query acceleration (Materialized View). Choose based on your latency requirements and trigger model.


I want to build a data pipeline (ODS → DWD → ADS)

Recommended: Dynamic Table, define a SQL query and the system automatically computes incremental updates and maintains the result — no manual scheduling required.

ScenarioApproachReference
Multi-layer pipeline with automatic incremental refreshDynamic TableDynamic Table Overview · Create Dynamic Table
End-to-end real-time ETL exampleDynamic Table tutorialReal-time ETL with Dynamic Tables
Near-real-time incremental processing pipelineDynamic Table + Table StreamBuild a Near-Real-Time Incremental Pipeline with Dynamic Tables
CDC data processing (real-time database change processing)Multi-table real-time sync + Dynamic TableComplete CDC and Data Processing Example
Implement SCD (Slowly Changing Dimensions)Table Stream + taskSCD Implementation Guide

I want to run scheduled batch ETL (T+1 / hourly)

Recommended: Studio SQL task + scheduling, write SQL in Studio, configure a Cron schedule, and monitor run status visually.

ScenarioApproachReference
SQL transformation task + periodic schedulingStudio SQL taskTask Development and Scheduling · Quick ETL Setup
Multi-task orchestration (with dependencies)Composite task / task groupComposite Task · Task Group
Python processing required (pandas / custom logic)Studio Python taskPython Task Development
Data modeling with dbtdbt + LakehouseIncremental Development with dbt on Lakehouse

I want to detect table data changes (CDC / incremental-driven)

Recommended: Table Stream, captures INSERT / UPDATE / DELETE changes on a table to drive downstream incremental processing.

ScenarioApproachReference
Capture table changes to drive downstream processingTable StreamTable Stream Overview · Create Table Stream
Table Stream best practicesTable Stream Best Practices

I want to accelerate queries (pre-computation / cached results)

Recommended: Materialized View, pre-computes and stores query results with automatic query rewriting for transparent acceleration.

ScenarioApproachReference
Pre-compute high-frequency complex queriesMaterialized ViewMaterialized View Overview · Create Materialized View
Query rewriting (transparent acceleration)Materialized View + query rewritingQuery Rewriting

I want to do SQL data transformation (cleansing / aggregation / joining)

ScenarioReference
SQL transformation basicsSQL Data Transformation Basics
Window functions (YoY / MoM / ranking)Data Transformation with Window Functions
Complex queries with CTEsData Transformation with CTEs
Nested data types (Array / Map / Struct)Nested Data Type Transformation
JSON data processingJSON Processing Guide for Complex Business Cases
Practical tipsSQL Transformation Tips
Funnel analysis and user behaviorFunnel Analysis Guide
Session analysis (Sessionization)Session Analysis Guide
Retention and cohort analysisRetention and Cohort Analysis Guide
Marketing attribution analysisAttribution Analysis Guide
Hierarchical queries (org charts / BOM)Hierarchical Query Workaround
Data deduplicationData Deduplication Guide
Data pivoting (rows to columns / columns to rows)Data Pivot and Transpose Guide
Cumulative calculations and running totalsRunning Total Guide

I want to ensure data quality

Recommended: Studio Data Quality Rules (DQC), configure validation rules to automatically intercept anomalous data before processing.

ScenarioApproachReference
Configure data quality check rulesStudio DQCQuick Data Quality Rule Setup
Complete data quality guideDQC rule configurationData Quality

I want to monitor pipeline run status

Recommended: Studio Operations Monitoring, visually view task run status, logs, and alerts.

ScenarioApproachReference
View task run logsStudio Operations CenterQuick Monitoring and Alerting Setup
Troubleshoot task failuresJob Profile diagnosticsJob History Analysis
DataOps production practicesComplete operations guideDataOps Data Safety and Stability Practices

Not sure which tool to use?

What is your transformation requirement? ├── Need continuously auto-refreshed results (data pipeline) │ ├── Data freshness is not critical (minutes acceptable) → Dynamic Table │ └── Data must always be current to support query rewriting → Materialized View ├── Periodic batch runs (T+1 / hourly) → Studio SQL task + scheduling ├── Need to detect row-level changes (INSERT/UPDATE/DELETE) → Table Stream └── One-time data cleansing / transformation → Write SQL directly (INSERT INTO ... SELECT)

For a complete tool selection guide, see: Real-Time Pipeline Selection Guide