Start using SQL CTEs today!
10 minutes
“You spend more time thinking than typing when you work with SQL.”
Every data professional should understand SQL databases and how to query data. A very important and useful feature of SQL is the use of SQL CTEs.
Common Table Expressions (CTE) are temporary results created during the processing of SQL scripts. They help simplify the code by splitting a code into smaller blocks.
Each CTE in an SQL script is executed sequentially, and the results of these are not stored in your database, but are used by other CTEs in the same SQL script or by the result of the SQL script.
The CTE syntax will make it easier to understand visually.
Let’s take an example : we have a data table, and we want to perform the following transformations on this data :
Select the latest data from this table based on date
Rename all columns names
Create new columns
Deduplicate the data rows based on the primary key column
Filter all rows using a condition
An amateur SQL programmer will create one long and complex query to complete all these tasks; something like this :
SELECT LOWER(SAFE_CAST(UserName AS STRING) AS user_name , PrimaryKey AS primary_key , EventTimestamp AS event_timestamp , Email AS email , IF '@' IN Email THEN 'valid email' ELSE 'invalid email' END AS is_email_valid , SPLIT(UserName, '')[0] AS first_name FROM data_table WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) AND LOWER(SAFE_CAST(UserName AS STRING) IS NOT NULL QUALIFY row_number() OVER(PARTITION BY PrimaryKey ORDER BY EventTimestamp DESC) = 1
It is not easy to understand such code, and as organisations deal with more complex cases than this simple example, writing and maintaining such SQL scripts is not a good practice. This SQL code can be made simpler using CTEs.
Syntax :
WITH cte_1 AS ( (SQL code for CTE 1) ) , cte_2 AS ( (SQL code for CTE 2) ) , cte_3 AS ( (SQL code for CTE 3) ) SELECT * FROM cte_2 LEFT JOIN SELECT * FROM cte_3 USING (joinin_key)
In this syntax, a long SQL script is divided into 3 smaller blocks. This makes it easier to assign each block (CTE), a specific task. Let’s create a better SQL query for our previous example.
The 5 tasks we want our SQL script to perform are :
Select the latest data from this table based on date
Rename all columns names
Create new columns
Deduplicate the data rows based on the primary key column
Filter all rows using a condition
WITH latest_data_cte AS ( SELECT * FROM data_table WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) ) , renamed_columns_cte AS ( SELECT LOWER(SAFE_CAST(UserName AS STRING) AS user_name , PrimaryKey AS primary_key , EventTimestamp AS event_timestamp , Email AS email FROM latest_data_cte ) , new_columns_cte AS ( SELECT * , IF '@' IN email THEN 'valid email' ELSE 'invalid email' END AS is_email_valid , SPLIT(user_name, '')[0] AS first_name FROM renamed_columns_cte ) , deduplicated_data_cte AS ( SELECT * FROM new_columns_cte QUALIFY row_number() OVER(PARTITION BY primary_key ORDER BY event_timestamp DESC) = 1 ) SELECT * FROM deduplicated_data_cte WHERE user_name IS NOT NULL
By using CTEs, we created a block for each task. Now if we see duplicate rows in our table, we know which part of the script to debug!
To follow the implementation with a live coding example, I recommend this tutorial by Patrick from Guy in a Cube: