Data Debt: Understanding Enterprise Data Quality Problems
Data debt, also known as data technical debt, refers to quality challenges associated within data sources. Data debt impedes the ability of your organization to leverage information effectively for better decision making, increases operational costs, and impedes your ability to react to changes in your environment.
This article is organized into several topics:
- Defining technical debt
- Data debt
- Why data debt is important
- Types of data debt
- Strategies for avoiding data debt
- Strategies for removing data debt
- Strategies for accepting data debt
- Related Resources
1. Defining Technical Debt
Technical debt refers to the implied cost of future refactoring or rework to improve the quality of an asset to make it easy to understand, work with, maintain, and extend. The concept of technical debt was first proposed by Ward Cunningham to describe the impact of poor quality code on your overall software development efforts. Since then people have extended the metaphor to other types of debt. Examples of technical debt include, but are not limited to:
- Highly-coupled source code.
- The wiring in your kitchen that goes through several walls and your flooring to get to the breaker box in the opposite corner in your basement.
- Poorly formatted values in a table column.
- A software package that runs on an ancient version of Oracle that is no longer supported.
2. Defining Data Debt
Let’s define a few important terms:
- Data source. A place where data comes from. Examples include files, databases, data feeds, and services/functions.
- Legacy data source. A data source that is currently deployed and potentially being accessed by one or more systems.
- Data debt. Data debt refers to any technical debt associated with the design of legacy data sources or with the data contained within those data sources.
3. Why Data Debt is Important
When we have significant data debt we face several challenges:
- Longer time to market. It is much more difficult to work with low-quality data sources than high-quality ones. This is due to increased effort to understand the data sources, to evolve them, and then to test them to ensure they still work as expected.
- Increased cost. The increased time to work with lower-quality data sources results in increased cost to do so.
- Unpredictability. Because most data debt is hidden it becomes difficult to predict how much effort it will be to work with, and to evolve, existing data sources because you often do not know how big the mess really is until you at least investigate the situation. Even when you do that you always run into unexpected problems when you start into the actual work.
- Poor decision support. Poor quality data, either due to inconsistencies, lack of timeliness, inaccuracies, or many other issues (see below) reduces the ability of leadership to make data-informed decisions.
- Decreased collaboration. An indirect problem with technical debt is that it can decrease collaboration between teams, which is unfortunate because data debt often requires cross-functional collaboration to remove. This decreased collaboration is often the result of “finger pointing,” the developers didn’t work with the source of record, the data people were too hard to work with, this team didn’t keep the documentation up to date, and so on.
4. Types of Data Debt
There are several categories of data debt to consider, summarized in Table 1.
Table 1. Types of data technical debt.
Type | Examples |
Structural. Quality issue with the design of a table, column or view. |
|
Data quality. Quality issue with the consistency or usage of data values. |
|
Referential integrity. Quality issue with whether a referenced row exists within another table and that a row which is no longer needed is (soft) deleted appropriately. |
|
Architectural. Quality issue with how external programs interact with a data source. |
|
Documentation. Quality issue with any supporting documents, including models. |
|
Method/functional. Quality issue with execution aspects within a data source, such as stored procedures, stored functions, and triggers. |
|
5. Strategies for Avoiding Data Debt
There are several explicit opportunities throughout the agile/lean lifecycle that enable you to avoid technical debt. They are:
- Initial conceptual modeling. Early in your initiative you will identify the scope of what you intend to accomplish. Part of this effort is exploring the domain, thereby modeling at a high level the data that is used by a solution, enabling a solution delivery team to have a better understanding of their data-oriented requirements, and thereby avoid data debt. This work occurs during “Sprint 0” on agile teams, Initiation on Lean teams.
- Initial architectural modeling. Early in your initiative you should invest the time to think through your architectural strategy. Your business architecture exploration should focus on the data architecture aspects of a solution, enabling the delivery team to work through how what what data sources their solution will work with. This work occurs during “Sprint 0” on agile teams, Initiation on Lean teams, putting a team in a position to avoid data debt by thinking through their architecture before they begin Construction.
- Continuous modeling. As you are modeling and mapping data in detail throughout construction you can choose to follow clean database design strategies and adopt test-driven development (TDD) approaches, that are applicable to all aspects of your solution design, including data. Thinking through your design, even on a just-in-time (JIT) basis, will reduce the chance of injecting new technical debt into your data sources.
6. Strategies for Removing Data Debt
There are several aspects of technical debt, and several strategies available to you to remove each one. The opportunities for removing data debt include:
- Improving data source implementation. When existing data sources have quality problems there are several options for fixing them, from safely refactoring them to the usually riskier option of rewriting them.
- Improving deliverable documentation. As you learned earlier the quality of your documentation, in particular documentation describing data sources and how to work with them, are an important aspect of your overall quality. There are several strategies for improving deliverable documentation.
- Improving data source format. The consistency of your naming conventions, field formats, data values, and other formatting aspects can also be improved and thereby address data technical debt.
- Reusing existing data sources. Greater reuse motivates investment in quality, and the production of high-quality assets motivates greater reuse of those assets. Do you want people to use existing sources of record? Then ensure those sources of record are high-quality and easy to work with.
7. Strategies for Accepting Data Debt
One strategy is to accept technical debt. The team makes a conscious decision to not avoid/remove technical debt at the current time which, as you can see in the technical debt quadrant of Figure 1, is a valid option. This is a decision that should be led by the architecture owner and confirmed by the product owner.
Figure 1. Martin Fowler’s technical debt quadrant, modified for data debt.
Reckless
|
Prudent
|
|
Deliberate |
The data architects are too difficult to work with. We don’t have time to understand existing data sources. |
We must ship now and deal with the consequences later.
|
Inadvertent |
What is data normalization? What is a “source of record”? We have data conventions? |
Now we know how we should have designed that data source.
|
8. Related Resources
- Clean Database Design
- Data Cleansing: Applying The “5 Whys” To Get To The Root Cause
- Data Quality: How to Assess DQ Techniques
- Data Repair
- Refactoring Databases: Evolutionary Database Design
- Systems Are Data Driven, People Are Data-Informed
- Technical Debt (Wikipedia)
- Technical Debt Quadrant by Martin Fowler.
- Ward Explains Debt Metaphor by Ward Cunningham