Data Quality Lifecycle

Data Quality products and initiatives fail primarily because most of them focus on measurement or observability but do not follow through the entire lifecycle of data all the way up to fixing the issues. Furthermore, without knowing the business context of the data, any data quality checks, anomaly or outlier detections performed will end up generating more irrelevant alerts vs actionable data quality alerts. We believe strongly, a true data quality lifecycle starts with understanding the data from a business context and ends with fixing or improving the data quality vs. simply measuring.

We define the Data Quality Life cycle in these simple six steps –

  • Connect to Multiple Sources – Ability to connect to a wide variety of data sources with multiple options e.g., scan, pull data with or without metadata etc., This can also be extended with the ability to interpret semantics or business context by leveraging your existing data catalog or governance systems’ glossary.
  • Discover Semantics – Understand and classify data from a business context. Is the data a phone number, aSSN or a loan origination number? This identification is critical not only for business validation but also for detecting any false positives during forecasting/benchmarking/outlier or anomaly detection. This also enables auto discovery data quality checks allowing all stakeholders to manage expectations and strive for consensus within organizations.
  • Measure Data Quality – Measure, score and identify bad data using auto discovered rules across various attributes. Our platform boasts the ability to measure at the attribute level which provides the flexibility to cumulatively measure at a data set, data source, department/function or organizational level. Our platform provides a score that can be understood across all stakeholders and can be used for search,relevance, as well as for discovery of assets.
  • Monitor and Alert using Adaptive Thresholds – Ability to set adaptive thresholds without the need for manual rules using benchmarking or forecasting trends. Cover a wide variety of DataOps, Data Observability use cases such as data pipeline monitoring, source to target checks, schema or data level deviations or abnormalities.
  • Remediate to Improve Data Quality – Use a set of curation libraries to clean as much as possible automatically. This is also extended with remediation workflows, issue management with third party productivity and collaboration platforms such as Jira, ServiceNow and many more.
  • Derive Insights and Recommendations – Ability for both business and technical stakeholders to slice and dice to make sense of the bad data in their own ways. This is particularly useful to generate next best actions both strategic and tactically.

Without a focus on the entire data quality life cycle, organizations will never succeed in siloed or secondary data quality initiatives or outlier detection-based monitoring.

Latest events

Webinar

DQLabs in Action: Delivering the Data that Matters

The DQLabs Platform delivers the data that matters.  Data minds and Business minds see data in different ways even when working on the same data for the same business outcomes. Growing data volumes make it more important than ever for organizations to find a platform that enables data democratization with a scalable cloud platform that provides the right data, to the right user in the right context.

Join us on this webinar to learn how the DQLabs Platform harnesses the combined power of Data Quality and Data Observability to enable data producers, consumers, and leaders to achieve decentralized data ownership and turn data into action faster, easier, and more collaboratively.

Agenda

12:00 pm: Welcome & Introductions

12:05 pm: Industry Insights: Data Observability Decoded

12:15 pm: DQLabs in Action: Platform Showcase 

12:30 pm: Questions & Answers 

12:45 pm: Close

View More Arrow image

Best Practices

Data Observability is Irrelevant without Context

BLOGS

Data Observability is Irrelevant without Context

With today’s rapid growth in data, a focus towards data quality is more critical than ever before. Since the first bit of digital data existed Data Quality has existed alongside it as a practice, and yet after decades gaps in the marketplace still exist as the focus towards business impact is often forgotten. We quite often get lost in the mix of what to measure or how to measure instead of focusing in on the real purpose of what Data Quality was meant to be.

Data Observability is Irrelevant without Context

In fact…

Data Quality is never about the “Data” but all about “Business”.

As we navigate through this space, it is also very important to identify noise from true value. For folks who have not read the six guiding factors to navigate around noise in data quality, take a moment to read Part 1 — The Noise in Modern Data Quality of this series. For the remaining folks, let’s continue to explore the new kid in the block — Data Observability.

First of all is Observability new? Not really!

It has its origination from control theory to industrial processes to software and centers around the idea of understanding/observing /inferring an internal state of a systems using its external output. But to understand and be successful in Observability, you must have the ability to observe the behavior of a system and derive its health state from it. Also it’s important to understand the difference between observability and monitoring. Monitoring is best suited for known failure modes while Observability is for debugging using the insights inferred. You can also use this comparison from Splunk on Monitoring vs. Observability as well to understand further but the main goal is the ability to notice (observe) something you have from system behavior.

Data Observability and Data Monitoring

OK, so why is observability relevant?

During the last few years, we have seen many transformations happening in the “modern data stack” space. “Modern data stack” meaning a set of tools and technologies that enable analytics, particularly for transactional data. It started with the enormous scalability and elasticity of cloud data warehouses and extends to modern data pipelines allowing us to carry large amounts of data from multiple sources and transform the data directly inside the warehouses (ELT) at ease. This trend has also now morphed to ETLT (extract-transform-load-transform i.e., best of both worlds of ELT and ETL), ELTG (with governance layer), and merging of data lakes and warehouses into a Lakehouse or Unified Analytics Warehouse. With such complexities in mind, yes — it is relevant to have a system to observe and funnel up things to debug so as to eliminate and prevent any downtime or provide insights on the health of data sources.

However, when you extend Observability to Data Quality or, as I like to call it “Business Quality”, now you are no longer talking just “data” but “business” aka “data with context”. Data in its raw form without any application of use case (i.e., how an organization uses and governs, how it collects and handles, or its functional value) doesn’t have “context”. Let’s take a very simple example of measuring the total records (“volume” — one of the pillars of observability). If a million rows suddenly drop from your data, yes — you should be aware of it in the context of understanding the health of your data sources. Yet, the problem comes when you extend observability to a broader umbrella of data quality as it can pose some unique challenges. In this example, let’s say that the drop in records when debugged could be related to a strategic leadership move to close a couple of retail stores which weren’t profitable. Let’s think about another example — an email column that is used for marketing is treated very differently than an email column used for risk and authentication purposes. Same email, same format but different use case, therefore the broader application of data quality measurements and the underlying purpose are very different. This poses a big challenge.

Now if you take that and apply it across all your petabytes of data sources, various monitors, as well as pipelines, the amount of time needed to debug each of these “alerts” can be time-consuming and daunting. When extended into benchmarking to discover anomalies, it is bound to create even more false positives or negatives as well. To me, just throwing anomaly detection at your data doesn’t create value. Why? In the world of outlier detection even with all of today’s advanced algorithms, model accuracy isn’t complete without some sort of understanding or ground truth to compare with the model’s predictions. In other words, the expectation is important even before the measurement is done as without expectation there is no base to draw conclusions from. This poses a big challenge when you draw observability beyond its limit to the wonderful world of Data Quality. With Data Quality it is not only health checks but there are further levels of checks around all aspects of objective and subjective dimensions of Data Quality. This requires understanding the context.

If you speak to any expert or customer who has had their hands dirty in Data Quality, the first thing you will hear is —

“Measurement was never the challenge but making measurements purposeful is the one and only challenge”.

So, for anyone who sets out to build an “actionable” data quality assessment, starting without an ultimate purpose will result in more data chaos. This means for a true Modern Data Quality platform, you have to go beyond observability into the world of inferring metadata, use of artificial intelligence/machine learning (AI/ML) to understand the context or semantics, expand the universe into knowledge graphs and bring new practices through automation to simplify data quality processes. This is an exciting time to be in the space of Modern Data Quality as we are just scratching the surface of what we can bring innovation to.

To conclude, if you are talking about data observability and applying it across the most basic health checks of your data sources and pipelines, then it is still relevant. But, if you are going to stretch observability across all your subjective and objective dimensions of Data Quality / Business Quality without context, then it’s irrelevant and it’s time to think differently about your data culture.

 

Author

Raj Joseph,

President & CEO DQLabs, Inc.

View More Arrow image