Best practices in creating Analytics Pipelines
12 minutes
This post discusses ETL (Extract, Transform, Load) best practices for deriving value from data through analytics and insights engineering. I’ll cover best practices for building robust data engineering pipelines in a dedicated post soon.
With the decreasing cost of cloud storage, companies now store vast amounts of data, aiming to extract value as needed. This enables engineering teams to develop complex analytics pipelines for reports, dashboards, AI, and business analysis. Data can address specific business use-cases, though this often involves a trade-off between complexity and readability in analytics pipelines and ETL processes.
Before embarking on any analytics project, a good analytics engineer should consider several key questions. For this discussion, we’ll assume that the raw data has already been ingested into a data warehouse or data lake.
Can I partition the source data on dates ?
For tables with events (transactions, user interactions, subscriptions/registrations with dates), it is always important to make a partitioned table before using.
This is the first step in the data model / data pipeline.
With partitioning, we can also think about how much time in the history we can delete. (for example, we can add new data / new partitions every day or week, and delete partitions 2 years in history)
For reference tables (such as tables with user data, products) without a column date, you can do a “full-refresh” every day. (full-refresh = recreate the table every day with all the data)
As a result, the first step in a data pipeline or data model is to provide the partitioned or unpartitioned data sources in the data warehouse. (e.g. BigQuery, Snowflake)
For event data, it is better to make incremental tables in which we add the latest data every day.
Nomenclature is important to identify the different data tables and views.
TABLE_SOURCE or TABLE_RAW are good table names to identify the tables as raw / source tables.
2. Do I have duplicate rows in my table?
Now I have the source data in my data warehouse. Perhaps there are duplicates in the table.
The second step is to de-duplicate the source table.
You can change the data types for the columns here. This way, we are sure about the deduplication.
Maybe the same transaction is in the table today and yesterday, we can take the last line of today.
Or the same line is present twice in the data, we can take one of the two to have unique lines.
Once we have the unique rows in our table, we must create a primary key for the table.
For example, once on the unique rows in an interactions table, each interaction must have an interaction_id.
For example, once you have a table with users, each row must have a user_id.
For event tables, it is important to deduplicate on the last partitions and to add these partitions to the deduplicate table.
Now we have de-duplicated the data and casted columns to correct data types.
Table nomenclature is important, for example:
DATA _DEDUP / SOURCE _DEDUP
3. Do I need to filter or unify the tables?
Sometimes we don't need all the lines, we can filter unwanted data now.
For example, if I want data for only users in a certain country or only the purchases in the last one month, we can filter it in this step.
It's better to filter the data here than in the sources (step 1) because if we need data for users of countries or purchase history for 6 months, we don't need to recreate the new data sources or do a back-fill in the same source, because we will still find this data in our source tables. (step 1, 2) (back-fill = add old data to a table)
If I need to merge several tables into a single table because there are the same columns or the same data, I can do it here.
For example: I can combine different user interactions (impressions, purchases and clicks) into a single table of interactions
For example, I can combine the user data from two different tables because it is the same type of information.
We can also calculate the new columns here in this step.
You can also rename the columns if necessary.
It's better to do views here and not the tables, that way we can take advantage of the partitioned data tables that we did in the second step.
The tables and views that we have after the third step are the sources for all the analytics which we calculate from now on.
So in this step, we merge different tables into one if needed, calculate some new columns if needed, and rename columns based on the business and data analytics needed.
Nomenclature is important, for example:
DATA_VIEW / DATA_WAREHOUSE / DATA_CLEANED
4. Once we have created the source tables, deduplicate tables and views ready to use, we can create data models, data pipelines and complex KPIS / Insights / Analytics.
Before using the data, we are sure that we have all the right columns and data types, no duplicates, and incremental sources.
Now, for each new pipeline or KPI or analytics to calculate, we have the source data ready.
Before creating new pipelines or data models, you must now ask:
Which views / tables should I use?
Is the final table incremental? If yes, I must calculate the KPIs on the latest partitions and add it to the final table. Otherwise I have to use all the data
to create the table or view.
Which KPIs should I calculate? It is me or it is the client who will define the KPIs.
How many steps / tasks / data models I have to create before I create the table or final view? How can I explain each step in the pipeline.
Sometimes, it is good to create a dedicated dataset for each analytics pipeline. Like this, all the tables or views created by the pipeline are together in the same dataset.
The Analytics built are the last steps in the pipeline after storing and cleaning the data. It is important to document and understand the business or product requirements before we start using the cleaned data warehouse for building the analytics feature.