SQL Data Transformation Basics
SQL data transformation is the process of converting data from one format or structure to another. The goal is to clean, integrate, and shape data so it can be effectively stored, analyzed, and utilized.
Basic Concepts
- Extract: Pulling data from various data sources such as databases and file systems. Common operations include data querying and data export.
- Transform: Performing various operations on the extracted data to meet the requirements of the target storage or analysis system. Common transformation operations include data cleansing, type conversion, and data aggregation.
- Load: Loading the transformed data into the target storage system, such as a data warehouse or data lake.
Common Data Transformation Operations
-
Data cleansing: Removing or correcting noise, duplicates, and errors in the data. For example:
- Delete null values:
DELETE FROM table_name WHERE column_name IS NULL; - Correct erroneous data:
UPDATE table_name SET column_name = 'Correct Value' WHERE column_name = 'Incorrect Value';
- Delete null values:
-
Type conversion: Converting data from one data type to another. For example:
- Convert a string to a date:
CAST(column_name AS DATE); - Convert an integer to a string:
CAST(column_name AS STRING);
- Convert a string to a date:
-
Data aggregation: Summarizing and computing statistics on data, such as sum, average, and count. For example:
- Calculate the sum:
SELECT SUM(column_name) FROM table_name; - Calculate the average:
SELECT AVG(column_name) FROM table_name;
- Calculate the sum:
-
Data merging: Combining data from different tables or data sources. For example:
- Merge data using a join:
SELECT a.*, b.* FROM table_a a JOIN table_b b ON a.id = b.id; - Merge data using a union:
SELECT column_name FROM table_a UNION SELECT column_name FROM table_b;
- Merge data using a join:
-
Data filtering: Selecting data that meets certain conditions. For example:
- Retrieve data matching a specific condition:
SELECT * FROM table_name WHERE column_name = 'value';
- Retrieve data matching a specific condition:
Data Model
TPC-H data represents the data warehouse of an automotive parts supplier, recording orders, items that make up the orders (lineitem), suppliers, customers, parts sold (part), regions, countries, and parts suppliers (partsupp).
Singdata Lakehouse has built-in shared TPC-H data that every user can query directly by adding the data context, for example:
Use Cases
- Data warehouse construction: When building a data warehouse, you need to extract, transform, and load data from different sources into the warehouse to ensure consistency and high quality.
- Business intelligence and data analysis: For BI and data analysis, raw data needs to be transformed so that analysis tools can efficiently parse and display it.
- Data migration: During data migration, data needs to be transformed to ensure structural and quality consistency between the old and new systems.
- Compliance and data governance: Ensuring data meets industry standards and regulatory requirements often requires data cleansing and transformation.
- Real-time data processing: In real-time and stream data processing, data needs to be transformed to support real-time analysis and decision-making.
SQL data transformation is a core aspect of data processing that helps organizations improve data quality and utilization efficiency, supporting better decision-making and business operations.
Basic Data Transformation with Singdata Lakehouse SQL
Retrieving Data
Use SELECT...FROM, LIMIT, WHERE, and ORDER BY to read the required data from a table.
The most common use of queries is to read data from a table. You can use the SELECT ... FROM statement as shown below.
However, running SELECT ... FROM without a limit can be problematic when the dataset is large.
To retrieve rows that meet specific conditions, use the WHERE clause. You can specify one or more filter conditions in the WHERE clause.
The WHERE clause can combine multiple filter conditions using AND and OR, as shown below.
You can combine multiple filter conditions as shown above. The equal (=) and greater than (>) conditional operators have already been demonstrated. There are 6 conditional operators in total:
- < Less than
- > Greater than
- <= Less than or equal to
- >= Greater than or equal to
- = Equal to
- <> and != both mean not equal to (some databases only support one of them)
For string types, you can use the LIKE condition for pattern matching. In a LIKE condition, _ represents any single character and % represents zero or more characters, for example:
You can also use IN and NOT IN to filter multiple values.
Use count(*) to get the number of rows in a table, as shown below.
To sort rows by the values of a specific column, use ORDER BY, for example:
Join
Use JOINs to combine data from multiple tables (there are different types of JOINs).
You can use joins to combine data from multiple tables. The format for writing a join query is as follows.
The first table specified (table_a) is the left table, and the second is the right table. When joining multiple tables, the joined dataset of the first two tables is treated as the left table and the third table as the right table (the database optimizes the join order for performance).
There are five main join types:
1. Inner Join (default): Only returns rows that exist in both tables
Note: The join defaults to an inner join.
The output includes records where at least one matching row is found in both orders and line items (same o_orderkey and order date within 5 days before or after the ship date).
You can also see that there are 24,792,743 matching rows across the orders and line items tables.
2. Left Outer Join (also known as Left Join): Returns all rows from the left table and matching rows from the right table
The output includes all rows from the orders table and records from the line items table that have at least one matching row (same o_orderkey and order date within 5 days before or after the ship date).
You can also see that the orders table has 151,947,677 rows and the line items table has 24,792,743 rows. The orders table has 1,500,000 rows, but the join condition produces 151,947,677 rows because some orders match multiple line items.
3. Right Outer Join (also known as Right Join): Returns matching rows from the left table and all rows from the right table
The output includes records from the orders table where at least one matching row is found (same o_orderkey and order date within 5 days before or after the ship date) and all rows from the line items table.
You can also see that the orders table has 24,792,743 rows and the line items table has 600,037,902 rows.
4. Full Outer Join: Returns all rows from both the left and right tables
The output includes records from the orders table where at least one matching row is found (same o_orderkey and order date within 5 days before or after the ship date) and all rows from the line items table.
You can also see that the orders table has 151,947,677 rows and the line items table has 600,037,902 rows.
5. Cross Join: Returns the Cartesian product of all rows
The output includes every row from the nation table joined with every row from the region table. There are 25 nations and 5 regions, so the cross join produces 125 rows.
Sometimes you need to join a table with itself, which is called a self-join.
Example:
- For each customer order, get the orders placed earlier in the same week (Sunday through Saturday, not the previous seven days). Only show customer orders that have at least one such earlier order.
Most analytical queries require computing metrics that involve multiple rows. GROUP BY lets you perform aggregate calculations on sets of rows grouped by specified column values.
Example:
- Create a report showing the number of orders for each order priority segment.
In the above query, rows are grouped by orderpriority and count(*) is applied to each group. The output contains one row per unique orderpriority value along with the count(*) result.
Allowed aggregate functions are typically SUM/MIN/MAX/AVG/COUNT. Some databases offer more complex aggregate functions; refer to your database documentation for details.
Subqueries
Use subqueries to use the result of one query inside another query.
When you want to use the result of one query as a table in another query, use a subquery. Example:
- Create a report showing each country, the number of items supplied by suppliers from that country, and the number of items purchased by customers from that country.
In the above query, there are two subqueries: one calculates the number of items supplied by each country's suppliers, and the other calculates the number of items purchased by each country's customers.
CASE WHEN
Use the CASE statement to replicate IF/ELSE logic.
You can apply conditional logic in the SELECT ... FROM part of a query, as shown below.
You can see how different values are displayed based on the totalprice column. You can also use multiple conditions as criteria (for example, totalprice > 100000 AND orderpriority = '2-HIGH').
Standard Functions
Use standard built-in database functions for common string, time, and numeric data operations.
When processing data, you often need to modify column values. Here are some standard functions to know:
-
String functions
- LENGTH calculates the length of a string. For example,
SELECT LENGTH('hi');outputs 2. - CONCAT combines multiple string columns into one. For example,
SELECT CONCAT(o_orderstatus, '-', o_orderpriority) FROM clickzetta_sample_data.tpch_100g.orders LIMIT 5;concatenates the o_orderstatus and o_orderpriority columns with a hyphen separator. - SPLIT splits a value into an array based on a given delimiter. For example,
SELECT SPLIT(o_orderpriority, '-') FROM clickzetta_sample_data.tpch_100g.orders LIMIT 5;outputs an array formed by splitting o_orderpriority at-. - SUBSTRING extracts a substring from a value given start and end character indices. For example,
SELECT o_orderpriority, SUBSTRING(o_orderpriority, 1, 5) FROM clickzetta_sample_data.tpch_100g.orders LIMIT 5;returns the first five characters (positions 1–5) of o_orderpriority. - TRIM removes spaces from both sides of a value. For example,
SELECT TRIM(' hi ');outputshiwithout surrounding spaces. LTRIM and RTRIM are similar but only remove spaces from the beginning or end of the string, respectively.
- LENGTH calculates the length of a string. For example,
-
Date and time functions
-
Adding and subtracting dates: Used to add and subtract time periods; the exact format depends on the database. In Lakehouse,
datedifftakes 3 parameters — the output unit (day, month, year) and date/time values a and b — and returns a - b.+ INTERVAL n UNIT(DAY/MONTH/YEAR)adds the specified unit value to a timestamp.
This displays the difference between two dates in the specified time unit. You can also add or subtract any time period from a date/time column. For example,
SELECT DATE '2022-11-05' + INTERVAL '10' DAY;outputs2022-11-15(try date subtraction as well). -
-
String <=> Date/Time conversion
When you want to change the data type of a string to date/time, use the DATE 'YYYY-MM-DD' or TIMESTAMP 'YYYY-MM-DD HH:mm:SS' functions. If the data is in a non-standard date/time format such as MM/DD/YYYY, you need to specify the input structure using date_format, for example:
You can also use date_format to convert a timestamp/date into a string of the desired format. For example:
See this page for how to set the correct date and time format.
- Time frame functions (YEAR/MONTH/DAY): When you want to extract a specific time component from a date/time column, use these functions. For example,
SELECT year(DATE '2023-11-05');returns 2023. Similarly, there are functions for month, day, hour, minute, and more.
See this page for more time functions.
-
Numeric functions
- ROUND specifies the number of decimal places. For example,
SELECT ROUND(100.102345, 2); - ABS returns the absolute value of a number. For example,
SELECT ABS(-100), ABS(100); - Mathematical operators such as +, -, *, /.
- CEIL/FLOOR returns the next higher or nearest lower integer for a given decimal. For example,
SELECT CEIL(100.1), FLOOR(100.1);
- ROUND specifies the number of decimal places. For example,
