Data Quality: An Overview of DQ Techniques
Work in progress
There are many data quality techniques available to you. More accurately, there are many techniques that may lead to improved data quality when they are followed properly. I say this because as you will soon see, some of the techniques included in this article aren’t primarily focused on quality, yet have secondary side effects of improved data quality. The goal of this article is to overview the data quality techniques that I have assessed for their effectiveness in practice.
Data Quality Techniques That Have Been Assessed
There are many potential data quality techniques for you to adopt:
- Data architecture. TBD
- Data cleansing (at point of use). TBD
- Data cleansing (at source). TBD
- Data cleansing (via AI). TBD
- Data contracts. TBD
- Data governance (lean/non-invasive). It is possible to take a lean/agile approach to data governance.
- Data governance (traditional). A command-and-control approach to data governance.
- Data guidance. Standards and guidelines to be followed when developing data-oriented assets (such as databases, data files, business intelligence (BI) outputs, and so on. Data guidance includes naming conventions, data security/privacy conventions, data retention guidelines, data residency guidelines, and so on. Data guidance is often driven by regulatory compliance requirements.
- Data labeling. An artificial intelligence (AI) strategy where the information contents represented by data is indicated. For example, a picture of a cat would be given the label “cat”.
- Data masking. The hiding (masking) of sensitive data, particularly personally identifiable information (PII).
- Data stewards. An oversight/governance role that is responsible for promoting the quality and fitness for purpose of organizational data. This role is often staffed by mid-level business decision makers.
- Database refactoring. A process by which an existing database schema is evolved in a safe and effective manner through the application of database refactorings. A database refactoring is a simple change to a database schema, such as renaming a table or splitting a column, which improves the quality of the design without changing the semantics (in a practical manner). There are 60+ proven database refactorings.
- Database regression testing (automated).
- Database regression testing (manual). Database testing is the validation of functionality implemented within, and the data values contained within, a database. Database regression testing is database testing done in a regular manner throughout the system development lifecycle, often as part of your continuous integration efforts.
- Defect logging. Regarding data, this is the recording of potential quality problems with the data. Defects are logged so that the owners of the data source have a record of the quality problem so that it may be addressed.
- Executable business rules. Business logic that is invoked to manipulate, and even cleanse, information. Business rules should be highly cohesive, performing one task such as performing a calculation, and loosely coupled to one another.
- Logical data modeling (LDM). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single initiative or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities.
- Master data management (MDM). The primary goals of Master Data Management (MDM) are to promote a shared foundation of common data definitions within your organization, to reduce data inconsistency within your organization, and to improve overall return on your IT investment. MDM, when it is done effectively, is an important supporting activity for enterprise architecture, for enterprise architecture in general, for business intelligence (BI) efforts, and for software development teams in general. Agile master data management is an evolutionary and highly collaborative approach to MDM.
- Physical data modeling (PDM). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional teams.
- Review (implementation). A review or walkthrough of your implementation – including source code, data schema, and testing code – by qualified people who were not directly involved with the work being reviewed.
- Review (model). A model review, also called a model walkthrough or a model inspection, is a validation technique in which your modeling efforts are examined critically by a group of your peers. The basic idea is that a group of qualified people, often both technical staff and stakeholders, get together to evaluate a model or document.
- Static schema analysis. An automated evaluation of your schema, including both structural (e.g. tables) and behavioral (e.g. stored procedures), to identify potential quality issues.
- Synthetic training data. An AI strategy where artificially generate data is added to an existing data store to include values that are statistically underrepresented in the original data source. This is an artificial intelligence (AI) strategy to remove bias in machine learning (ML) models.
- Transformation (T in ETL). Data is extracted (read) from a legacy data source, transformed so as to improve it’s quality, and then loaded into another data source.
Other Data Quality Techniques
Other data quality strategies that I may decide to rate at some point in the future include:
- Agile architecture envisioning. Initial, high-level modeling performed at the beginning of an initiative (or programme, or enterprise architecture effort) to identify a viable technical direction for the effort. The goal is to do just enough modeling to drive to the technical vision/strategy, not to create extensive models or detailed documentation. Typically performed in parallel to agile requirements envisioning.
- Agile data modeling. An evolutionary (iterative and incremental) and highly collaborative approach to modeling data. Agile data modeling is the act of exploring data-oriented structures in an iterative, incremental, and highly collaborative manner. Your data assets should be modeled, via an Agile Model Driven Development (AMDD) approach, along with all other aspects of what you are developing.
- Agile enterprise architecture. An evolutionary and highly collaborative approach to modeling, documenting, communicating, and governing architecture for an organization.
- Continuous database integration. Continuous integration is a development practice where developers integrate their work frequently, at least daily, where the integration is verified by an automated build. The build includes regression testing and possibly static analysis of the code. Continuous database integration is the act of performing continuous integration on your database assets. Database builds may include the creation of the database schema from scratch, something that you would only do for development and test databases, as well as database regression testing and potential static analysis of the database contents.
- Enterprise data modeling (EDM). Data modeling performed at a cross-system, enterprise/organization-wide level.
- Traditional architecture modeling.
- Traditional enterprise architecture.
- Non-solo work. An approach to working where two or more people actively collaborate to fulfill a task. Agile examples include the practices of pair programming and modeling with others.
Assessing The Effectiveness of Data Quality Techniques
In How to Assess Data Quality Techniques I describe how to apply the comparison factors captured in Figure 1. In How to Choose the Right DQ techniques I share 10 charts like the one that you see in Figure 2 and how to apply them to choose the best fit DQ techniques for the situation that you face. What I don’t do in either of those articles is describe the techniques that I’ve rated, hence this article.
Figure 1. Factors for comparing data quality techniques.
Figure 2. Comparing data quality techniques on effect on source and benefit realization.
- The Agile Database Techniques Stack
- Clean Database Design
- Configuration management
- Comparing Data Quality Techniques
- Continuous Database Integration (CDI)
- Data Quality in an Agile World
- Data Quality: How to Assess DQ Techniques
- Data Quality: Choosing the Right DQ Techniques
- Data Technical Debt
- Database refactoring
- Database testing
- Introduction to DataOps: Bringing Databases Into DevOps
- Test-Driven Development (TDD)
This book, Choose Your WoW! A Disciplined Agile Approach to Optimizing Your Way of Working (WoW) – Second Edition, is an indispensable guide for agile coaches and practitioners. It overviews key aspects of the Disciplined Agile® (DA™) tool kit. Hundreds of organizations around the world have already benefited from DA, which is the only comprehensive tool kit available for guidance on building high-performance agile teams and optimizing your WoW. As a hybrid of the leading agile, lean, and traditional approaches, DA provides hundreds of strategies to help you make better decisions within your agile teams, balancing self-organization with the realities and constraints of your unique enterprise context.
I also maintain an agile database books page which overviews many books you will find interesting.