Data Quality Techniques: Choosing the Right DQ Techniques
A fundamental principle of choosing your way of working (WoW) is that context counts. In other words, you want to choose the right technique(s) for the situation that you face – there is no such thing as a “best practice“. To do this you need to do two things: Understand the context that you face and understand the strengths and weaknesses of the techniques available. In this article I answer the second question and put data quality techniques into context by comparing them with one another.
I have found it useful to compare data quality techniques on five comparison factors:
- Timeliness. Is this technique typically applied because you’re reacting to a discovered DQ issue or are you applying it to proactively avoid or reduce DQ issues?
- DataOps automation. How much automation can be applied to support this technique in your DataOps pipeline? A continuous technique would be fully automated and automatically invoked as appropriate.
- Effect on source. How much effect, if any, does this technique typically have on the actual source of the DQ issue? For example, data cleansing at point of usage (see assessment below) has no impact on the data source even though it may have significant impact on the data as it is used.
- Benefit realization. When do you attain the benefit, the quality improvement, resulting from the technique? Some techniques, such as data cleansing, provide immediate quality benefits. Other techniques, such as data stewardship, may require years for the quality benefits to be realized.
- Required skills. What is the level of skills required to successfully perform this DQ technique? Does it require sophisticated skills that may need to be gained through training or experience? Or is the technique straightforward requiring little effort, if any, to learn?
Figure 1. Data quality technique comparison factors.
How Data Quality Techniques Compare
Although it is certainly useful to understand how all data quality techniques fare on each of the five comparison factors, that can be daunting. The challenge is that there are dozens of data quality techniques. So if we have twenty five techniques in our DQ toolkit, for example, then we have 125 ratings (25*5). I don’t know about you, but I can’t keep that many numbers in my head at once. But I can look information up. What I’ve found useful is the type of chart that you see in Figure 2 that visualizes how the techniques compare on two factors at a time.
Figure 2. Comparing DQ techniques on benefit realization and required skills.
Good things to know when reading these charts:
- Ratings. The techniques are rated on each comparison factor on a range of 1 to 5, with 1 being the least effective rating and 5 the most effective rating. Only integer values are allowed.
- Technique lists/columns. When you see a column of techniques – as you do with MDM, Review (implementation) and Review (model) in the middle of the diagram – those techniques are all associated with the blue dot in the bottom left corner of the column. In this case the ratings for those techniques is 3 and 3.
- Green box. The techniques in the green box – in this case data cleansing via AI, data contracts, data labeling, and data guidance – are the most desirable when it comes to these two factors. The green box indicates techniques that received a 4 or a 5 on both ratings. If you face a situation where benefit realization and required skills were critical then these are the techniques I would consider first.
- Red box. The techniques in the red box are the least desirable, having scores of 1 or 2 on both comparison factors. They are techniques that would likely prove to be a very poor fit if benefit realization and required skills were important in the situation that you face.
- The middle ground. The techniques in the middle, ones that don’t land in either the green or the red box, would be secondary options to consider if the ones in the green box aren’t right for you.
- A visual bug. The data cleansing technique, although visually landing in the green box, is not included in the most desirable list because it is has a rating of 3 for required skills (see point #2). As an aside, these charts are generated via Python’s Matplotlib. Over time I’m likely to fiddle with the code and improve the look of the charts, but right now this is just barely good enough (JBGE) for my purposes.
The other nine combinations of comparison factors follow:
Figure 3. Comparing DQ techniques on effect on source and required skills.
Figure 4. Comparing DQ techniques on DataOps automation and required skills.
Figure 5. Comparing DQ techniques on timeliness and required skills.
Figure 6. Comparing DQ techniques on effect on source and benefit realization.
Figure 7. Comparing DQ techniques on DataOps automation and benefit realization.
Figure 8. Comparing DQ techniques on timeliness and benefit realization.
Figure 9. Comparing DQ techniques on DataOps automation and effect on source.
Figure 10. Comparing DQ techniques on timeliness and effect on source.
Figure 11. Comparing DQ techniques on timeliness and DataOps automation.
Your Situation Drives Your Choices
Let’s put this all together. Figure 12 presents a high-level overview of the machine learning (ML) lifecycle. On the diagram you can see major categories of data sources – internal, external, and learnings from operation of the AI model itself. Indicated as well at key points are the data quality technique comparison factor issues that are relevant at that point in the lifecycle. For example, when it comes to internal data sources you prefer DQ techniques that have a direct effect on the data source ideally in a proactive manner (avoid DQ problems before they make it into the DB). This assumes that you own or are responsible for the data source and that is what you’re current focus is on. Looking at Figure 10, which compares timeliness and effect on source, I’m likely to choose database refactoring. I am also like to choose to create synthetic training data for data sources that are owned by my ML initiative but I wouldn’t put such data into an operational production data source as it wouldn’t belong there. I may choose to implement executable business rules in either a production data source, if performance allows and if I have authority to do so, to provide calculated values. However, I’m more likely to do so for any data sources completely owned by the ML team itself. And of course I’m likely to consider adopting strategies just outside of the green box in Figure 10 if the ones inside it aren’t sufficient for my needs. In short, I prefer to start by considering the techniques that are mostly likely best for the situation that I face, then work outwards from there as needed.
Figure 12. The machine learning lifecycle with data sources and data quality considerations indicated (click to enlarge).
When it comes to extracting data from internal data sources during construction the situation is a different. Three factors are in play, not just two. I’m ideally looking for techniques that are reactive, supported by continuous automation, with immediate benefit. To identify candidate techniques I’m going to look at two and perhaps three of Figures 7 (level of automation vs. benefit), 10 (timeliness vs benefit), and 11 (level timeliness vs automation). I’ll leave this as an exercise for you to work through.
Note that the reactive preference is interesting because it’s at the negative end of the timeliness spectrum. In this case I’m in construction so I’m likely dealing with the short-term need to deliver on or before a specific date. Proactive strategies, at the positive end of the spectrum, tend to require long-term investment. Given that there is a DQ issue right now, if any proactive DQ strategies are in place they’re not addressing this issue. The implication is that looking for green box strategies when timeliness is a factor isn’t going to work. For example, when it comes to timeliness vs level of automation I’m looking for timeliness ratings of 1 or 2 (leaning towards reactive) and level of automation ratings of 4 or 5 (leaning towards continuous) to start with. In this case the chart is still useful, but the green-box heuristic isn’t.
Choosing the Right DQ Techniques in Practice
The fundamental point is that you want to choose the right WoW for the situation that you face. To do so you need to understand the needs of the situation and the trade-offs associated with the potential techniques that may apply. In this article I showed that when it comes to DQ techniques there are five factors to consider when choosing the right one for your context: timeliness, level of automation, effect on source, benefit realization, and required skills.
I would be remiss not to point out that there’s really a sixth factor to consider: your team culture, and arguably a seventh factor, organizational culture. Your organizational culture is reflected in your team culture, it’s very often a constraining factor in practice, so my focus is usually on team culture. I mention this factor only now because it’s both important yet something I can’t address via this strategy as it’s unique to you. It pertains to your context but not to the tradeoffs of the techniques themselves.
Related Resources
- The Agile Database Techniques Stack
- Clean Database Design
- Configuration management
- Continuous Database Integration (CDI)
- Data Cleansing: Applying The “5 Whys” To Get To The Root Cause
- Data Quality in an Agile World
- Data Quality: How to Assess DQ Techniques
- Data Quality Techniques
- Data Repair
- Data Technical Debt
- Database refactoring
- Database testing
- Test-Driven Development (TDD)
Recommended Reading
This book, Choose Your WoW! A Disciplined Agile Approach to Optimizing Your Way of Working (WoW) – Second Edition, is an indispensable guide for agile coaches and practitioners. It overviews key aspects of the Disciplined Agile® (DA™) tool kit. Hundreds of organizations around the world have already benefited from DA, which is the only comprehensive tool kit available for guidance on building high-performance agile teams and optimizing your WoW. As a hybrid of the leading agile, lean, and traditional approaches, DA provides hundreds of strategies to help you make better decisions within your agile teams, balancing self-organization with the realities and constraints of your unique enterprise context.
I also maintain an agile database books page which overviews many books you will find interesting.