The Agile Data (AD) Method

Data Debt: Understanding Enterprise Data Quality Problems

Data debt, also known as data technical debt, refers to quality challenges associated within data sources. Data debt impedes the ability of your organization to leverage information effectively for better decision making, increases operational costs, and impedes your ability to react to changes in your environment.

This article is organized into several topics:

  1. Defining technical debt
  2. Data debt
  3. Why data debt is important
  4. Types of data debt
  5. Strategies for avoiding data debt
  6. Strategies for removing data debt
  7. Strategies for accepting data debt
  8. Related Resources

1. Defining Technical Debt

Technical debt refers to the implied cost of future refactoring or rework to improve the quality of an asset to make it easy to understand, work with, maintain, and extend. The concept of technical debt was first proposed by Ward Cunningham to describe the impact of poor quality code on your overall software development efforts. Since then people have extended the metaphor to other types of debt. Examples of technical debt include, but are not limited to:

  • Highly-coupled source code.
  • The wiring in your kitchen that goes through several walls and your flooring to get to the breaker box in the opposite corner in your basement.
  • Poorly formatted values in a table column.
  • A software package that runs on an ancient version of Oracle that is no longer supported.

2. Defining Data Debt

Let’s define a few important terms:

  • Data source. A place where data comes from. Examples include files, databases, data feeds, and services/functions.
  • Legacy data source. A data source that is currently deployed and potentially being accessed by one or more systems.
  • Data debt. Data debt refers to any technical debt associated with the design of legacy data sources or with the data contained within those data sources.

3. Why Data Debt is Important

When we have significant data debt we face several challenges:

  1. Longer time to market. It is much more difficult to work with low-quality data sources than high-quality ones. This is due to increased effort to understand the data sources, to evolve them, and then to test them to ensure they still work as expected.
  2. Increased cost. The increased time to work with lower-quality data sources results in increased cost to do so.
  3. Unpredictability. Because most data debt is hidden it becomes difficult to predict how much effort it will be to work with, and to evolve, existing data sources because you often do not know how big the mess really is until you at least investigate the situation. Even when you do that you always run into unexpected problems when you start into the actual work.
  4. Poor decision support. Poor quality data, either due to inconsistencies, lack of timeliness, inaccuracies, or many other issues (see below) reduces the ability of leadership to make data-informed decisions.
  5. Decreased collaboration. An indirect problem with technical debt is that it can decrease collaboration between teams, which is unfortunate because data debt often requires cross-functional collaboration to remove. This decreased collaboration is often the result of “finger pointing,” the developers didn’t work with the source of record, the data people were too hard to work with, this team didn’t keep the documentation up to date, and so on.

4. Types of Data Debt

There are several categories of data debt to consider, summarized in Table 1.

Table 1. Types of data technical debt.

Type Examples
Structural. Quality issue with the design of a table, column or view.
  • Extra view, column, …
  • Improperly named column, table, …
  • Improperly split table
  • Insufficiently normalized operational data
  • Overly normalized reporting data
Data quality. Quality issue with the consistency or usage of data values.
  • Null business key value
  • Duplicated business key value
  • Different key values for the same business entity AND traceability isn’t maintained
  • Corrupted data
Referential integrity. Quality issue with whether a referenced row exists within another table and that a row which is no longer needed is (soft) deleted appropriately.
  • Dropped or missing triggers
  • Inconsistent value in calculated column
  • Missing foreign key constraints
  • Null foreign key values
Architectural. Quality issue with how external programs interact with a data source.
  • Data-intensive calculation outside of database
  • Inappropriate encapsulation
  • Inappropriate security access control
  • Missing index
Documentation. Quality issue with any supporting documents, including models.
  • Difficult to navigate or find
  • Inconsistent information
  • Outdated information
  • Overly-detailed information
  • Missing information
  • Static, non-executable, documentation
Method/functional. Quality issue with execution aspects within a data source, such as stored procedures, stored functions, and triggers.
  • Inconsistent naming conventions
  • Incorrect calculation
  • Overly complex code
  • Poorly named trigger or stored procedure
  • Slow calculation, procedure, …

5. Strategies for Avoiding Data Debt

There are several explicit opportunities throughout the agile/lean lifecycle that enable you to avoid technical debt. They are:

  1. Initial conceptual modeling. Early in your initiative you will identify the scope of what you intend to accomplish. Part of this effort is exploring the domain, thereby modeling at a high level the data that is used by a solution, enabling a solution delivery team to have a better understanding of their data-oriented requirements, and thereby avoid data debt. This work occurs during “Sprint 0” on agile teams, Initiation on Lean teams.
  2. Initial architectural modeling. Early in your initiative you should invest the time to think through your architectural strategy. Your business architecture exploration should focus on the data architecture aspects of a solution, enabling the delivery team to work through how what what data sources their solution will work with. This work occurs during “Sprint 0” on agile teams, Initiation on Lean teams, putting a team in a position to avoid data debt by thinking through their architecture before they begin Construction.
  3. Continuous modeling. As you are modeling and mapping data in detail throughout construction you can choose to follow clean database design strategies and adopt test-driven development (TDD) approaches, that are applicable to all aspects of your solution design, including data. Thinking through your design, even on a just-in-time (JIT) basis, will reduce the chance of injecting new technical debt into your data sources.

6. Strategies for Removing Data Debt

There are several aspects of technical debt, and several strategies available to you to remove each one. The opportunities for removing data debt include:

  1. Improving data source implementation. When existing data sources have quality problems there are several options for fixing them, from safely refactoring them to the usually riskier option of rewriting them.
  2. Improving deliverable documentation. As you learned earlier the quality of your documentation, in particular documentation describing data sources and how to work with them, are an important aspect of your overall quality. There are several strategies for improving deliverable documentation.
  3. Improving data source format. The consistency of your naming conventions, field formats, data values, and other formatting aspects can also be improved and thereby address data technical debt.
  4. Reusing existing data sources. Greater reuse motivates investment in quality, and the production of high-quality assets motivates greater reuse of those assets. Do you want people to use existing sources of record? Then ensure those sources of record are high-quality and easy to work with.

7. Strategies for Accepting Data Debt

One strategy is to accept technical debt. The team makes a conscious decision to not avoid/remove technical debt at the current time which, as you can see in the technical debt quadrant of Figure 1, is a valid option. This is a decision that should be led by the architecture owner and confirmed by the product owner.

Figure 1. Martin Fowler’s technical debt quadrant, modified for data debt.

Reckless
Prudent

Deliberate

The data architects are too difficult to work with.

We don’t have time to understand existing data sources.

We must ship now and deal with the consequences later.

Inadvertent

What is data normalization?

What is a “source of record”?

We have data conventions?

Now we know how we should have designed that data source.

8. Related Resources


Recommended Reading

Choose Your WoW! 2nd Edition
This book, Choose Your WoW! A Disciplined Agile Approach to Optimizing Your Way of Working (WoW) – Second Edition, is an indispensable guide for agile coaches and practitioners. It overviews key aspects of the Disciplined Agile® (DA™) tool kit. Hundreds of organizations around the world have already benefited from DA, which is the only comprehensive tool kit available for guidance on building high-performance agile teams and optimizing your WoW. As a hybrid of the leading agile, lean, and traditional approaches, DA provides hundreds of strategies to help you make better decisions within your agile teams, balancing self-organization with the realities and constraints of your unique enterprise context.

 

I also maintain an agile database books page which overviews many books you will find interesting.