Start using SQL CTEs today!

SQL

10 minutes

“You spend more time thinking than typing when you work with SQL.”

Every data professional should understand SQL databases and how to query data. A very important and useful feature of SQL is the use of SQL CTEs.

Common Table Expressions (CTE) are temporary results created during the processing of SQL scripts. They help simplify the code by splitting a code into smaller blocks.

Each CTE in an SQL script is executed sequentially, and the results of these are not stored in your database, but are used by other CTEs in the same SQL script or by the result of the SQL script.

The CTE syntax will make it easier to understand visually.

Let’s take an example : we have a data table, and we want to perform the following transformations on this data :

  • Select the latest data from this table based on date

  • Rename all columns names

  • Create new columns

  • Deduplicate the data rows based on the primary key column

  • Filter all rows using a condition

An amateur SQL programmer will create one long and complex query to complete all these tasks; something like this :

SELECT
  LOWER(SAFE_CAST(UserName AS STRING) AS user_name
  , PrimaryKey AS primary_key
  , EventTimestamp AS event_timestamp
  , Email AS email
  , IF '@' IN Email THEN 'valid email'
      ELSE 'invalid email'
    END AS is_email_valid
  , SPLIT(UserName, '')[0] AS first_name
FROM data_table
WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
AND LOWER(SAFE_CAST(UserName AS STRING) IS NOT NULL
QUALIFY row_number() OVER(PARTITION BY PrimaryKey ORDER BY EventTimestamp DESC) = 1

It is not easy to understand such code, and as organisations deal with more complex cases than this simple example, writing and maintaining such SQL scripts is not a good practice. This SQL code can be made simpler using CTEs.

Syntax :

WITH cte_1 AS (

  (SQL code for CTE 1)

)

, cte_2 AS (

  (SQL code for CTE 2)

)

, cte_3 AS (

  (SQL code for CTE 3)

)

SELECT *
FROM cte_2
LEFT JOIN
SELECT *
FROM cte_3
USING (joinin_key)

In this syntax, a long SQL script is divided into 3 smaller blocks. This makes it easier to assign each block (CTE), a specific task. Let’s create a better SQL query for our previous example.

The 5 tasks we want our SQL script to perform are :

  • Select the latest data from this table based on date

  • Rename all columns names

  • Create new columns

  • Deduplicate the data rows based on the primary key column

  • Filter all rows using a condition

WITH latest_data_cte AS (

SELECT *
FROM data_table
WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

)

, renamed_columns_cte AS (

SELECT LOWER(SAFE_CAST(UserName AS STRING) AS user_name
, PrimaryKey                               AS primary_key
, EventTimestamp                           AS event_timestamp
, Email                                    AS email
FROM latest_data_cte

)

, new_columns_cte AS (

SELECT *
, IF '@' IN email THEN 'valid email'
    ELSE 'invalid email'
  END                                  AS is_email_valid
, SPLIT(user_name, '')[0]              AS first_name
FROM renamed_columns_cte

)

, deduplicated_data_cte AS (

SELECT *
FROM new_columns_cte
QUALIFY row_number() OVER(PARTITION BY primary_key ORDER BY event_timestamp DESC) = 1

)

SELECT *
FROM deduplicated_data_cte
WHERE user_name IS NOT NULL

By using CTEs, we created a block for each task. Now if we see duplicate rows in our table, we know which part of the script to debug!

To follow the implementation with a live coding example, I recommend this tutorial by Patrick from Guy in a Cube:

Previous
Previous

Binging on content : Data driven

Next
Next

Data Engineering vs Data Science vs Data Analytics