Data Transformation with CTE

Let's first understand the basic concepts, advantages, and common use cases of using Common Table Expression (CTE) with Lakehouse SQL for data transformation.

Basic Concepts

A Common Table Expression (CTE) is an expression that defines a temporary result set that exists within the execution scope of a SQL query. It is typically used to simplify complex queries, perform recursive queries, or break down query steps. CTEs are usually introduced by the WITH keyword and can be used in subsequent SELECT, INSERT, UPDATE, and DELETE statements.

Basic syntax:

WITH cte_name AS ( SELECT column1, column2, ... FROM table_name WHERE condition ) SELECT * FROM cte_name;

Advantages

  1. High Readability: CTE can make complex queries more readable and maintainable. By breaking down complex query logic into multiple understandable parts, CTE enhances the structure and clarity of the query. CTE makes testing complex queries simpler.

    • CTE is a SELECT statement that can be reused within a single query.
    • Complex SQL queries often involve multiple subqueries. Multiple subqueries can make the code difficult to read.
    • Using Common Table Expressions (CTE) can make your queries more readable.
  2. Strong Reusability: CTE can be referenced multiple times within the same query, thus avoiding code duplication and improving query reusability.

  3. Step-by-Step Data Transformation: CTE allows the data transformation process to be implemented step by step, with each step defined as an independent CTE, making it easier to debug and test.

  4. Support for Recursive Queries: CTE supports recursive queries, making it very suitable for handling recursive data structures with hierarchical relationships, such as tree-structured data.

Usage Scenarios

Here are some common scenarios for data transformation using CTE:

1. Data Cleaning and Transformation

CTE can be used to clean and transform data step by step, such as removing duplicates and correcting data formats:

WITH cleaned_data AS ( SELECT DISTINCT column1, column2 FROM raw_table WHERE column1 IS NOT NULL ), transformed_data AS ( SELECT column1, UPPER(column2) AS transformed_column2 FROM cleaned_data ) SELECT * FROM transformed_data;

2. Grouping and Aggregation

CTE can be used for complex grouping and aggregation operations, and the aggregation results can be referenced in subsequent queries:

WITH total_sales AS ( SELECT customer_id, SUM(order_amount) AS total_spent FROM orders GROUP BY customer_id ) SELECT customer_id, total_spent FROM total_sales WHERE total_spent > 1000;

3. Data Merging and Joining

Using CTE can simplify multi-table joins and data merging operations:

WITH customer_orders AS ( SELECT c.customer_id, c.customer_name, o.order_id, o.order_date FROM customers c JOIN orders o ON c.customer_id = o.customer_id ) SELECT * FROM customer_orders WHERE order_date > '2024-01-01';

Data Model

TPC-H data represents a data warehouse for an auto parts dealer, recording orders, items that make up the orders (lineitem), suppliers, customers, parts sold (part), regions, countries, and parts suppliers (partsupp).

Singdata Lakehouse has built-in shared TPC-H data, which each user can directly use by adding the data context, for example:

SELECT * FROM clickzetta_sample_data.tpch_100g.customer LIMIT 10;

Data Transformation Using Singdata Lakehouse SQL CTE

How to Define CTE

-- CTE Definition WITH supplier_nation_metrics AS ( -- Define CTE 1 using WITH keyword SELECT n.n_nationkey, SUM(l.l_QUANTITY) AS num_supplied_parts FROM clickzetta_sample_data.tpch_100g.lineitem l JOIN clickzetta_sample_data.tpch_100g.supplier s ON l.l_suppkey = s.s_suppkey JOIN clickzetta_sample_data.tpch_100g.nation n ON s.s_nationkey = n.n_nationkey GROUP BY n.n_nationkey ), buyer_nation_metrics AS ( -- Define CTE 2 SELECT n.n_nationkey, SUM(l.l_QUANTITY) AS num_purchased_parts FROM clickzetta_sample_data.tpch_100g.lineitem l JOIN clickzetta_sample_data.tpch_100g.orders o ON l.l_orderkey = o.o_orderkey JOIN clickzetta_sample_data.tpch_100g.customer c ON o.o_custkey = c.c_custkey JOIN clickzetta_sample_data.tpch_100g.nation n ON c.c_nationkey = n.n_nationkey GROUP BY n.n_nationkey ) SELECT -- No comma needed before the final SELECT statement n.n_name AS nation_name, s.num_supplied_parts, b.num_purchased_parts FROM clickzetta_sample_data.tpch_100g.nation n LEFT JOIN supplier_nation_metrics s ON n.n_nationkey = s.n_nationkey LEFT JOIN buyer_nation_metrics b ON n.n_nationkey = b.n_nationkey LIMIT 10;

Calculate the Amount Lost Due to Discounts

Use the lineitem table to get the price of items in the order (excluding discounts) and compare it with the order. First, determine the granularity of the comparison. Think step by step, first get the price of all items in the order excluding discounts, and then compare it with the order data of totalprice that has already calculated the discount.

WITH lineitem_agg AS ( SELECT l_orderkey, SUM(l_extendedprice) AS total_price_without_discount FROM clickzetta_sample_data.tpch_100g.lineitem GROUP BY l_orderkey ) SELECT o.o_orderkey, o.o_totalprice, l.total_price_without_discount - o.o_totalprice AS amount_lost_to_discount FROM clickzetta_sample_data.tpch_100g.orders o JOIN lineitem_agg l ON o.o_orderkey = l.l_orderkey ORDER BY o.o_orderkey;

Do not overuse CTE. Pay attention to code readability.

  1. CTE can help improve the readability and reusability of queries.

  2. Do not overuse CTE; pay attention to the size of the query.

    • SQL queries with multiple temporary tables are better than 1000-line SQL queries with numerous CTEs.
    • Keep the number of CTEs in each query small (depending on the size of the query, but usually < 5).
  3. If you have doubts about the performance of CTE, check your query plan.

Resources

Common Table Expression (CTE)