At the end of the day, your data analytics needs to be tested like any other code. If you don’t validate this code—and the data it generates—it can be costly (like $9.7-million-dollars-per-year costly 😬, according to Gartner Research).
To avoid this fate, companies and their engineers can leverage a number of proactive and reactive data validation techniques. We heavily recommend the former, as we’ll explain below. A proactive approach to data validation will help companies ensure that the data they have is clean and ready to work with.
“An ounce of prevention is worth a pound of cure.” It’s an old saying that’s true in almost any situation, including data validation techniques for analytics. Another way to say it is that it’s better to be proactive than it is to be reactive.
The purpose of any data validation is to identify where data might be inaccurate, inconsistent, incomplete, or even missing.
By definition, reactive data validation takes place after the fact and uses anomaly detection to identify any issues your data may have, and to help ease the symptoms of bad data. While these methods are better than nothing, they don’t solve the core problems causing the bad data in the first place.
Instead, we believe teams should try to embrace proactive data validation techniques for their analytics, such as type safety and schematization, to ensure the data they get is accurate, complete and in the expected structure (and that future team members don’t have to wrestle with bad analytics code).
While it might seem obvious to choose the more comprehensive validation approach, many teams end up using reactive data validation. This can be for a number of reasons. Often, analytics code is an afterthought for many non-data teams and therefore left untested.
It’s also common, unfortunately, for data to be processed without any validation. In addition, poor analytics code only gets noticed when it’s really bad, usually weeks later when someone notices a report is egregiously wrong or even missing. Dribbble, one of our customers, faced that exact situation.
"We would launch a new feature and two weeks later we’d realize the tracking wasn’t being triggered as expected. That caused us to lose confidence and trust in our data."
While all these methods may help you solve your data woes (and often with objectively great tooling), they still won’t help you heal the core cause of your bad data (e.g., piecemeal data governance or analytics that’s implemented on a project-by-project basis without cross-team communication) in the first place, leaving you coming back to them every time.
Reactive data validation alone is not sufficient; you need to employ proactive data validation techniques in order to be truly effective and avoid the costly problems mentioned earlier. Here’s why:
Now that we’ve established why proactive data validation is important, the next question is: How do you do it? What are the tools and methods teams employ to ensure their data is good before problems arise?
Let’s dive in. 🤿
Data validation isn’t just one step that happens at a specific point. It can happen at multiple points in the data lifecycle—at the client, at the server, in the pipeline, or in the warehouse itself.
It’s actually very similar to software testing writ large in a lot of ways. There is, however, one key difference. You aren’t testing the outputs alone; you’re also confirming that the inputs of your data are correct.
Let’s take a look at what data validation looks like at each location, examining which are reactive and which are proactive.
You can use tools like Iteratively to leverage type safety, unit testing, and linting (static code analysis) for client-side data validation.
Some of the tools that you might use at the client layer include:
Now, these tools are a great jumping-off point, but it’s important to understand what kind of testing they’re enabling you to do at this layer. Here’s a breakdown:
As an example, Box began using Iteratively for data validation to improve their data governance, mainly because of our type safety functionality and CI integration. This approach helps them ensure that the events they’re capturing match their data security and compliance needs, as well as generating clean data that their teams can actually use downstream.
Data validation in the pipeline is all about making sure that the data being sent by the client matches the data format in your warehouse. If the two aren’t on the same page, your data consumers (product managers, data analysts, etc.) aren’t going to get useful information on the other side.
Segment Protocols helps data teams diagnose data quality issues (reactive), enforce data constraints (proactive), and transform data as needed (reactive). Finally, Iglu support schema validation to ensure your analytics match the planned definition you set for your code, both in development (proactive) and production (reactive).
Data validation methods in the pipeline may look like:
Going back to the Box example, they use Iteratively to verify the schema in the client before they send it to their product analytics platform. They also validate that the event that fires actually matches the payload, which helps prevent breakdowns in data collection.
You can use dbt testing, Dataform testing, and Great Expectations to ensure that data being sent to your warehouse conforms to the conventions you expect and need. You can also do transformations at this layer, including type checking and type safety within those transformations, but we wouldn’t recommend this method as your primary validation technique since it’s reactive.
At this point, the validation methods available to teams include validating that the data conforms to certain conventions, then transforming it to match them. Teams can also use relationship and freshness tests with dbt, as well as value/range testing using Great Expectations.
All of this tool functionality comes down to a few key data validation techniques at this layer:
A great example of many of these tests in action can be found by digging into Lyft’s discovery and metadata engine Amundsen. This tool lets data consumers at the company search user metadata to increase both its usability and security. Lyft’s main method of ensuring data quality and usability is a kind of versioning via a graph-cleansing Airflow task that deletes old, duplicate data when new data is added to their warehouse.
In the past, data teams struggled with data validation because their organizations didn’t realize the importance of data hygiene and governance. That’s not the world we live in anymore.
Companies have come to realize that data quality is critical. Just cleaning up bad data in a reactive manner isn’t good enough. Hiring teams of data engineers to clean up the data through transformation or writing endless SQL queries is an unnecessary and inefficient use of time and money.
It used to be acceptable to have data that’s 80% accurate (give or take, depending on the use case), leaving a 20% margin of error. That might be fine for simple analysis, but it’s not good enough for powering a product recommendation engine, or detecting anomalies, or making critical business or product decisions.
Companies hire engineers to create products and do great work. If they have to spend time dealing with bad data, they’re not making the most of their time. But data validation gives them that time back to focus on what they do best: creating value for the organization.
The good news is that high-quality data is within reach. To achieve it, companies need to help everyone understand its value by breaking down the silos between data producers and data consumers. Then, companies should throw away the spreadsheets and apply better engineering practices to their analytics, such as versioning and schematization. Finally, they should make sure data best practices are followed throughout the organization with a plan for tracking and data governance.
In today’s world, reactive, implicit data validation tools and methods are just not enough anymore. They cost you time, money, and, perhaps most importantly, trust.
To avoid this fate, embrace a philosophy of proactivity. Identify issues before they become expensive problems by validating your analytics data from the beginning and throughout the software development life cycle. And get in touch if you want to know more about how Iteratively can help with proactive validation and improve your data quality and governance.
Did we miss anything? Let us know by sending us an email at [email protected].