Continuous Data Validation

Tinyclues

Introduction

Tinyclues migrated from an AWS-PySpark-Python stack to a GCP-dbt-SQL stack over 6 months for 100+ clients. The focus was on completing the migration quickly, followed by enhancements to the new stack. Initially, this led to subpar data quality, highlighting a crucial need for improving data quality and monitoring.

Following the migration, data issues emerged:

  • Data Scientists: Reported reduced accuracy in ML models.

  • Product Managers: Noted inaccurate insights and analytics on product features.

  • Customer Success & Data Operations Teams: Communicated data issues faced by clients.

  • Overall decrease in product performance.

Example 1 :
The data scientist found that users scored by the ML model were absent from the user database. This issue arose because the client provided transaction data for users not listed in the user table. Consequently, the data ingested for experiments was unusable, requiring the client to resend accurate data.

Example 2 :

A client sends data through a BigQuery connection, but at 2AM, when our Airflow pipeline runs, the client table is empty, resulting in analytics pipelines to use empty data. By 9:00AM, the client reports issues, prompting tech team to identify the problem. The engineer re-triggers the pipeline at 9:30AM, retrieving the correct data and resolving the issue. This problem recurred for two nights until we rescheduled the pipeline to 8AM, as the client data table is updated by 6AM.

Problem Statement

Proper project scoping is crucial when developing new systems.

I was creating a framework comprising tests, data monitoring dashboards, alerting, and data pipelines to identify data issues, apply measures based on issue criticality, monitor data, trigger alerts, and minimize manual intervention in issue resolution.

In the first two weeks, I met with stakeholders—data scientists, data operations managers, and product managers—to identify their pain points, recurring problems, and time-consuming tasks related to data issues. From these discussions, I prioritized the issues into three categories:

  • Critical Issues: Problems that can break the platform.

  • Feature Issues: Problems that can break one or more product features.

  • Performance Issues: Problems that reduce performance without breaking features.

Based on the feedback, I scoped the following solutions:

  • Data Validity Tests: Implemented tests to verify data accuracy.

  • Enhanced Monitoring: Developed data pipelines and dashboards to improve data health monitoring.

  • Alerts: Set up Slack and JIRA notifications for data pipeline issues.

  • Blue-Green Deployment: Introduced a deployment process to load data into pre-production, perform checks, and then deploy to production.

  • Automation: Streamlined manual tasks through automated processes.

  1. Data pipeline to store dbt artifacts in BigQuery

Each dbt execution generates JSON files with details about the execution process.

We require two key dbt artifacts:

  • run_results.json: Contains information on all dbt data models (.sql) executed by a run command.

  • sources.json: Holds details about dbt sources and tests executed by a source freshness command.

Architecture of the new pipeline :

After dbt execution, Airflow uploads the JSON files to a dedicated GCS bucket. The JSON is then converted to text and stored as a single row in a BigQuery table for various use cases.

2. dbt tests

Based on stakeholder discussions, dynamic test cases were added to dbt to capture data quality issues effectively.

Two important test cases were :

  • Event gaps test

    Designed to identify missing data in event records. For example, if a client misses sending purchase history for 3 days, the test alerts us on the fourth day about the gap. The test is customizable to account for different client data schedules (e.g., daily or weekday-only data submissions) and can ignore historical gaps due to exceptional circumstances, such as travel restrictions during COVID. This test is implemented across all event tables to detect data gaps.

  • Relationship ratio test

    This test ensures that user data is present for all user_ids involved in purchase transactions, email clicks, opens, and unsubscribes in the CRM data. It checks the intersection of users in the user database with other tables to prevent data loss and maintain the integrity of transaction and email event data. By confirming that user data is consistent across datasets, dbt tests help ensure that data-driven applications function correctly.

3. Data Health Dashboards

Using dbt artifacts and the file ingestion stack, I developed a dashboard for internal teams to monitor data quality. The dashboard provides both an overall and detailed view of data health, including the ingestion status of raw files sent by clients.

Key features include:

  • Real-Time Monitoring: Tracks daily variations in relationship ratios, nulls, duplicates, event gaps, and other key data columns.

  • Priority Scoring: Utilizes a rule-based algorithm to assign a priority score (0-3) and categorize each client as ‘HEALTHY,’ ‘CRITICAL,’ or ‘WARNING’ based on their data quality issues.

  • Issue Reporting: Includes links to email templates for different data issues, allowing the data operations team to contact clients with custom reports.

  • Product Integration: Enables the product team to set UI banners to alert users about deteriorating product performance linked to clients with ‘CRITICAL’ issues.

4. Slack alerts

  • Airflow Pipeline Alerts: Integrated failure alerts into a dedicated Slack channel using a JIRA webhook.

  • dbt Test Failure Alerts: Routed test failure alerts to a JIRA board, prioritizing clients based on the number and severity of issues.

5. Blue-green deployment architecture

Blue-Green deployment is a software engineering practice where code, data, or application changes are tested in a production-like environment before being released into production. The release occurs only after the pre-production environment passes all checks and tests.

In our dbt implementation, data was written directly to the production environment before running tests. This meant that any issues identified by the tests would have already affected production data.

To address this, a blue-green deployment approach was proposed: data is first written to a pre-production environment, where tests are performed before moving the data to production.

Implementation

Results

Reduced Downtime: Downtime for clients decreased by 70%, with product issues dropping from 3 per month to about 1 due to enhanced data visibility and proactive problem detection using data health dashboards.

  • Improved DataOps Efficiency: Manual tasks were reduced by 60% thanks to automated data pipelines and dashboards that included email templates with all necessary information. The need for ad-hoc querying decreased.

  • Enhanced Product Features: The product team utilized the new framework to integrate client data health directly into the platform.

  • Reduced Human Intervention: A rule-based algorithm prioritized data issues, minimizing the need for manual coordination within the data operations team.

  • Addressed Data Scientist Pain Points: New tests effectively tackled issues encountered by data scientists during ML model training.

  • Adaptive Testing: Smart tests adjusted thresholds based on previous dbt run data, such as comparing actionable user counts day-over-day to detect significant irregularities.

Next
Next

Configuration Reshape