A book that focuses on data quality for data pipeline ETLs —

data observability and data reliability too!

O’Reilly Media, Inc., 2022-09-02, ISBN: 978-1-098-11204-2

Many books and tutorials have been written about “data quality” and basically what that means. However, this book takes singular aim on projects such as data pipelines, data warehousing, data integrations, business intelligence/analytics, data lakes, big data, and other types of data ETLs. It was all smartly done by the authors. 

The book reiterates that “many data engineering teams face “good pipelines, bad data problems; and good data pipeline infrastructures, but often with bad data”.

Although I’ve searched long and hard to find a book like this to guide me in data pipeline quality and testing, this is the most comprehensive.  

Integrating data from multiple sources and then cleaning up, aggregating, and transforming a lot of that data to meet operational needs, is the difference between just having data and getting value out of it. 

Attention to data pipeline quality is not new. Increasingly, companies are embracing the best methods of data development, testing, and observation to increase and realize their investments in data, analytics, and machine learning applications.

Data Quality Fundamentals for Trustworthy Data Pipelines is one of the few books that addresses data pipeline quality efforts and explains it all in light of modern data stacks. Readers learn about common considerations and critical decision points when responding to data pipeline quality challenges.

The book’s setup

“In this era when organizations are increasingly dependent on advanced data initiatives such as AI, ML, data science, and business intelligence, there’s a need to ensure that the data behind these technologies can be trusted. The timeliness, completeness, and cleanliness of data can make or break the fortunes of data-driven enterprises.”

This book is for everyone who has suffered from unreliable data pipelines and ETL projects – and those who want to do something about it. The authors have addressed readers such as data engineers, ETL developers, data analysts, and others on data science projects who are actively involved in building, scaling, and managing their company’s data pipelines.

The authors deliver details on confronting data quality at scale by leveraging best practices for tool selection, data testing, monitoring, reliability, and observability used by the world’s most innovative companies:

Build more trustworthy and reliable data pipelines

Run data observability scripts and tools to execute data checks and identify broken or faulty pipelines

Learn to set and maintain data SLAs, SLIs, and SLOs

Develop and lead data quality initiatives

Learn how to create data services and systems for production support

Automate data lineage graphs across end-to-end data ecosystems

Build anomaly detectors to monitor and observe data assets

A summary and excerpts from 10 chapters in this nearly 300-page book

• Why Data Pipeline Quality Deserves Attention Now. The authors explain why and how the quality of architectural and technological trends has contributed (in some ways) to an overall decline in data quality, governance, and reliability. 

“The rise of the cloud, distributed data architectures and teams, and the move towards quicker data production have put the onus on data leaders to help their companies drive towards more reliable data (leading to more trustworthy analytics). Achieving reliable data is a marathon, not a sprint, and involves several steps in your data pipeline. Further, committing to data quality advancements is much more than a technical challenge; it’s very much organizational and cultural too.”

• Assembling the Building Blocks of Reliable Data Systems. This chapter describes constructing more resilient and testable data system architectures by examining data quality measures across key data pipeline technologies, including data warehouses, and data lakes.

“To achieve truly discoverable data, it’s important that your data is not just “cataloged,” but also accurate, clean, and fully observable from ingestion to consumption—in other words, reliable. Only by understanding your data, the state of your data, and how it’s being used—at all stages of its life cycle, across domains—can we even begin to trust it. “

• Data pipeline collection, cleanup, transformation, and testing. Readers are presented with a planning process for data cleaning and transformations with quality and reliability in mind, i.e., mapping sources to targets, monitoring pre- and post-production data for issues, observability, data quality logs, and thorough data cataloging to serve as an inventory of metadata. 

“Tackling data downtime isn’t just responding to stakeholders when null values are discovered in downstream dashboards. Data downtime can be proactively avoided by integrating data quality checks at every stage of each data pipeline, from the warehouse or lake to the BI reporting layer.”

• Building End-to-End Data Lineage Learn about incident “root cause analysis” as you discover how to build field-level data lineage using popular open-source tools that should be in every data engineer’s arsenal. Data lineage is the path that data takes through your data system, from creation, through all databases and transformation jobs – all the way down to final destinations like analytics dashboards and feature stores.

“We discuss what it takes to build more reliable data workflows, zeroing in on one key technology: lineage. As you work towards more reliable data, it’s hard to understand where you’re going if you don’t know where you’re starting from. Data lineage is the “map” of your data’s path that tells you what stages of the data pipeline are affected by data downtime.” 

• Monitoring and Anomaly Detection for Your Data Pipelines. Learn how to build and access data quality monitors and observation tools.

“This chapter takes you on a short safari through monitoring and anomaly detection related to basic data quality checks. How can these concepts help us apply detectors to our production environments in data warehouses and lakes?”

• Democratizing Data Quality. Here, the authors discuss cultural and organizational obstacles that data teams should overcome when evangelizing and “democratizing data quality” at scale – mapping the roles and responsibilities of project tasks. For example, handling your data as a product, understanding your company’s personnel roles for data quality, and how to structure the data team for maximum business impact. 

“The democratization of data is a technical as well as a cultural process. Regardless of where you fall on the RACI matrix (i.e., a type of responsibility assignment matrix (RAM) in project management) of data personas, chances are that data quality plays an essential role in your ability to succeed as a data practitioner.”

Recommendation

I have no ill words for this book, Data Quality Fundamentals! It’s a practical roadmap for improving IT lifecycles in data pipeline projects. It’s also the most valuable book on data quality for ETL projects (pipelines and all the rest) that I have seen. Anyone tasked with improving their data testing, and quality should read this book. 

About Wayne Yaddow

Wayne Yaddow is a freelance tech writer and ETL tester focusing on data quality and data testing. Much of his work can be seen on the DZone, Dataversity, TDWI, and Tricentis websites.