Data Quality (DQ) Metrics: Strategies for Measuring Data Quality

High-quality data meets or exceeds the quality criteria of its consumers. But how do you know whether you have such data? You don’t really know if you don’t have a data quality metrics strategy in place.

This article addresses the following topics pertaining to data quality metrics:

Why measure data quality?
Potential data quality metrics
Potential challenges with metrics
Strategies for effective data quality metrics
Parting thoughts

Why Measure Data Quality?

There are two fundamental reasons to measure data quality:

You want to improve data quality. Deciding to measure something focuses attention on it, indicating where your data quality problems lie and hopefully motivating you to address them. The implication is that an essential step in any measurement program is to determine what you want to improve, then identify what you need to measure to achieve it. Many standard metrics strategies – including key performance indicators (KPIs), objectives and key results (OKRs), and balanced scorecards – are based on this premise.
Metrics enable effective data governance. Providing accurate and timely measures to senior leadership enables them to govern your enterprise data more effectively.

Potential Data Quality Metrics

Table 1 lists data quality factors, the typical targets that are affected by those issues, and potential metrics for each factor. This is not meant to be an exhaustive list; the goal is to show that meaningful metrics can be captured for each data quality factor. Due to differing terminology between data technologies, I use the following terms:

Attribute. This can refer to a column in a relational table, a key in a JSON structure, or an attribute in an XML structure.
Instance. This can refer to a row in a table, a JSON object, or an XML object.
Source. This can refer to a database, a data file, a service, or a data stream.
Structure. This can refer to a table, an XML structure, a JSON structure, …
Value. This can refer to the value of a specific column in a table row, the value of a key-value pair in a JSON object, or the value of an attribute-value pair in an XML object.

Table 1. Potential metrics for each data quality factor.

Data Quality Factor	Target	Potential Metric(s)
Accessibility/privacy. Do people have access to the data that they should, and no more?	Attribute, Source, Structure	Number of incidents by type. Count the number of valid requests for access (from your help desk).
Accuracy. Can you confirm that the data value represents the concept that it is meant to?	Attribute, Structure	Default value rate. Calculated as Number of Values With Default Value/Total Number of Instances. Error rate. To determine if a value is correct, verify it against a trusted source, apply a business rule, or compare values across multiple data sources. Error rate is calculated as the Number of Incorrect Values/Total Number of Values Tested. Outlier rate. A value is considered an outlier when it falls outside of N standard deviations, where you determine N. The outlier rate is calculated as the Number of Outliers/Total Number of Values Tested.
Completeness. Are all the necessary data attributes present?	Attribute, Structure	Missing attribute count. A count of the number of attributes in a data structure compared to what you expect there to be, e.g. there are 17 columns in this table, but we expect 19. Attribute existence count. You have a list of attributes that you expect to exist in a structure; are they all there? Are there extra attributes? Do you care? Note that an additional attribute may indicate a security risk. Null count. The number of null/empty fields where that is not allowed.
Conformity. Is the format or type of the values of a data attribute consistent?	Attribute, Structure, Value	Conformance rate. This is the Number of Conformant Values/Total Tested. For an individual value, apply a business rule to determine whether it conforms.
Consistency. How well does the data align with other representations of it?	Attribute, Structure	Consistency rate. This is the Number of Consistent Values/Total Tested. For an individual value, compare it to other representations. These other representations may result from a transformation, a business calculation, or the output of a different system.
Integrity. Are the relationships between this data and other data correct?	Attribute, Structure, Source	Null foreign key count. A count of the number of instances of a data structure where a foreign key attribute(s) contains a null value. Can be calculated at the attribute level or the structure level (many structures have multiple attributes that represent foreign keys). Number of orphans. A count of the number of instances within a data structure that are not referenced by any other instances elsewhere. Referential integrity (RI) violation rate. Within a data source, this is calculated as Total Number of Orphans/Total Number of Instances.
Precision. How close is the data value to the real-world value? Has it been rounded, for example?	Value	Conformance rate. See above. Significant digits/decimal places. Is a value being stored at the required number of digits? e.g. 3.14159 vs 3.14 Standard deviation of measurements. Are the values of an attribute distributed as you’d expect? Time precision rate. Are datetime stamp values captured to the required level? i.e. Do you require millisecond-level precision but only getting second-level precision?
Reasonability. Do the data values make sense given the context?	Attribute, Value	Outlier rate. See above. Range check. Are the values in the range that you expect? e.g. The expected temperature range in Toronto, Canada is -20C to +35C, but you have a measurement of 47C? Z-score. The z-score for a value is the number of standard deviations it is away from the mean for the attribute. For example, in July the temperature mean is 22.5C with a standard deviation of 1.9C, so a temperature of 47C in Toronto is (47-22.5)/1.9 = 12.9.
Relevance. Is the data what you need and no more?	Source, Structure	Dark data rate. Dark data is data that is never used. The dark data rate is calculated as Number of Dark Instances/Total Instances. Data usage statistics. The percentage of instances accessed within a structure or source, often calculated for a given period of time (i.e. this quarter, this month, …). Timeliness. See the timeliness DQ factor below.
Reliability. Can you trust the data source?	Source, Structure	The other metrics applicable to data sources and data structures are typically combined to provide insight into reliability.
Timeliness. Is the data sufficiently current?	Instance, Source, Structure	Data freshness. How long has it been since a structure (or source) was last updated? Time-to-value (TTV). How long is it between being the values of an instance being collected and being used? TTV rate. The average (or distribution) of TTV values for the instances within a structure or source.
Understandability. Does the data make sense? Is the format/representation clear?	Source, Structure, Value	Data lineage coverage rate. You have data lineage coverage for a value when you know where it comes from and how it was calculated (if appropriate). The data lineage coverage rate is calculated as the Number of Values With Data Lineage Coverage/Total Tested. Metadata availability rate. For a given structure, do you have the taxonomy metadata about its attributes, the ontology metadata describing the relationships it is involved with, and other appropriate structure metadata. Schema consistency rate. Do the attributes and structures within a source follow a common naming convention? Ideally, do they follow your enterprise naming conventions? The schema consistency rate is calculated as Number of Consistent Elements/Total Elements Considered where an element may be an attribute or complete structure.
Uniqueness. Is the data recorded once and once only?	Attribute, Source, Structure	Distinct value count. For an attribute, this is the count of distinct values stored within it. For a structure, this is the count of distinct instances stored within it. For a source, this is the number of distinct structures stored in it (for example, a relational database may have more than one copy of a reference table). Duplicate record rate. For a structure, this is calculated as the Number of Duplicate Instances/Total Instances In The Structure. Primary key violation rate. An instance is said to have a primary key (PK) violation when its identifying attribute(s) have the same value as those of another instance within the structure, or has a null/empty value in one or more identifying attributes. This is calculated as the Number of Instances With a PK Violation/Total Instances In The Structure.
Validity. How close is the data value to what it is expected to be?	Value	Cross-field validation rate. The idea is that you apply business rules that use values from other instances, often in other structures, to determine whether a given value is correct. This metric is calculated at the Number of Validated Values/Number of Values Tested. Default value rate. See above. Format check. See above. Outlier rate. See above. Range check. See above

Potential Challenges with Data Quality Metrics

There are several issues/challenges to consider regarding metrics, including data quality metrics:

Every metric is contextual. Any given metric applies to some situations but not others, and it reveals different insights depending on the situation. Context counts.
You get what you measure. When people know how they are being measured, or at least they believe they know how they are being measured, that will motivate them to do whatever they can to make themselves look good based on that metric. For example, you ask a team to calculate the dark data rate for a table in an info mart. This motivates them to focus on supporting question stories that reflect high-usage requests, thereby increasing the chance that a given data instance is accessed, while deprioritizing stories requiring data only for annual reports.
Sometimes you get what you don’t want. The law of unintended consequences states that actions, such as measurement, in a complex system will have unforeseen and often unintended consequences. For example, you ask a team to collect data timeliness metrics. This motivates them to accept updates from external sources, for example social media data about brand sentiment, more often than is actually needed. This improves the timeliness of that data, improving their metrics, at the expense of increasing the cost of obtaining that data.
Don’t manage to the metrics. Metrics should only inform, not determine your decisions.

Strategies for Effective Data Quality Metrics

The general advice for how to succeed with metrics is directly applicable to data quality metrics. In my experience, the following strategies prove effective in practice:

Trends are better than scalars. Knowing that the current quality level is 87%, which is a scalar value, is interesting but not particularly useful in practice. Knowing that the current level of quality has been rising steadily over the last three months to 87%, has remained flat over the previous three months at 87%, or has been falling for the last three months to 87% is much more valuable. For example, in Figure 1, we see that the current outlier rate for a data source is 0.19%, which for this organization is very low. We’re clearly doing well. But, we also see that the outlier rate has been trending upwards over the last few months, so we’re not doing so well after all.
Distributions are better than scalars. Knowing how data is distributed, perhaps as a statistical distribution of values or as a distribution by category, can provide valuable insight. For example, in Figure 1 the right-most chart shows us how the outlier rate is distributed across the primary tables in this database. We see that the Loan table, followed by the Policy table, seems to be the most problematic. This is valuable information that would allow us to focus out data repair efforts (if appropriate).
Never rely on a single metric. It is the combination of metrics, each providing different insights, that enables you to determine how well you are meeting an objective.
Prefer automated over manual measures. Once the initial investment in tooling has been made, automated measures are less expensive to gather, are less likely to have been gamed, and guaranteed to be tracked.
Metrics must be timely. Would you be happy to discover that the Outlier rate in Figure 1 was calculated four months ago, before you had a push to clean up your data? Probably not. But what if it was calculated an hour ago?
Compete against yourself. Comparing people, teams, or even divisions based on metrics is a recipe for dysfunction. In competitive situations, people will do everything they can to game the metrics in their favor, rather than focusing on the behaviors you were hoping to engender via your metrics strategy.
Metrics must provide positive value. The benefit derived from the metric must exceed the cost of measuring it.
Prefer leading over trailing metrics. A trailing metric helps you to understand the past, and a leading metric allows you to predict the future.
Whether a metric is trailing or leading depends on context. For example, defect trend is a trailing metric when used to describe the quality of your work. It is a leading indicator when used to determine the likelihood of releasing on schedule (a long-term trend of a growing backlogof defects indicates you’re likely to run out of time to fix the problems).

Figure 1. Communicating a metric as a scalar, a trend, and as a distribution (click to enlarge).

Parting Thoughts

I believe that data, in particular enterprise data, should be treated as an asset. The implication is that your organization should actively strive to ensure that your data is of high quality. This includes having a pragmatic enterprise metadata program, adopting modern data quality practices, and adopting a pragmatic data quality metrics strategy.

Source

Some material in the article was adapted from my book Not Just Data: How to Deliver Continuous Enterprise Data.