5 Pillars of Data Observability: How Recency, Distribution, Volume, Schema, and Lineage Maintain Healthy Data

At the heart of every effective process are guiding principles that help make it successful. For data observability, those guiding principles can be simplified into five pillars: recency, distribution, volume, schema, and lineage. Evaluated together, they help organizations guarantee the health of their data.

5 Pillars of Data Observability: How Recency, Distribution, Volume, Schema, and Lineage Maintain Healthy Data

What is data observability?

Data observability, which simply means ensuring the correctness and completeness of data, is becoming increasingly important for today’s enterprises as they rely on almost unfathomable amounts of data to dictate their business decisions — approximately 350 terabytes of data, in fact. Considering a terabyte equates to 500 hours of movies or 132,150 650-page books, you can see the overwhelming prospect an average company faces in organizing and ensuring the accuracy of that amount of information.

The five core pillars of data observability help with that process by alerting you to your data’s health and reliability while also providing the analysis and insights needed to pinpoint and resolve issues before they cause distress throughout your organization. This ensures complete, correct, and useful data and mitigates data downtime.

Let’s take a closer look at each of these five pillars of data observability to help you craft your own effective data observability strategy.

1. Recency 

Also referred to as “freshness,” recency looks at the timeliness of your data. It goes without saying that the more current and up-to-date the data is, the better, as old data can lead to decisions that are based on data that is no longer relevant.

Recency also analyzes the rhythm of the data. If there are unusual temporal gaps in your tables, exploring recency will help you unearth this issue. For example, if you expect a given pipeline to ingest several thousand records per hour during an average business day, but you find it’s ingesting hundreds, dozens, or no records at all, recency measures can help alert you to those scenarios.

2. Distribution 

Distribution measures can identify anomalies that may indicate an unexpected change in your data source upstream or might indicate your data is incomplete (in this case, your null values might suddenly go off the charts). By evaluating normal ranges for your data, distribution can clue you into a data issue if your information unexpectedly starts falling out of normal range. If once consistent distribution patterns begin fluctuating unexpectedly (e.g., a 50/50 distribution amongst a particular segment begins reporting 90/10), this should trigger an immediate investigation. 

3. Volume 

Did your data tables suddenly drop from 2 million rows to 500,000 rows? Chances are you’ve got a volume problem. Volume, like distribution, ensures the amount of data you’re receiving is in line with historical expectations, so if the amount of data in your database sees a significant shift, something is amiss with your intake.

In addition to a sudden drop in data volume, data tables beginning to display null values where there weren’t any previously is another indication of your data being askew. 

4. Schema 

Schemas define how your data is structured and organized among tables, columns, and views. Having a clear process for who updates schema and how they do it is a key tenant of your data observability strategy, as changes in the source data’s structure are often the cause of downtime. For instance, adding or removing a field upstream can cause a pipeline to fail if not accounted for in the pipeline’s logic. Always remember schema before making any sweeping changes to your data processes.

5. Lineage 

Like pulling up previous versions of a Google doc to see who edited them, when, and how, lineage is the pillar that can help you trace how your data has been impacted by changes. Specifically, lineage gets to the “where” of data problems (i.e., what changes might have been made upstream to impact distribution of data downstream). By identifying which areas of data collection are impacted and where those changes were made, lineage is key to pinpointing and resolving data issues. 

When combined, these five pillars of data observability provide an effective framework for building a strong data observability solution. They go above and beyond simple monitoring to deliver a robust and holistic framework for safeguarding data and preventing insufficient data from accumulating.

Want to learn more? Take a deeper dive into data observability and gain examples of it in action in our free ebook “Data Observability: The Heartbeat of Healthy Data”.

September 1st, 2022

Get the latest industry news and insights delivered straight to your inbox.

Sign up for our Newsletter