Clean Data Architecture: Architectural Concerns

by Scott W. Ambler

Data architecture addresses the data aspects of a system or enterprise. A clean data architecture is one that easy to understand, to implement, and to evolve. To create and maintain a clean data architecture, you must make trade offs between a collection of architectural concerns such as coupling, cohesion, security, scalability, resiliency, and many others.This article addresses the following topics:

1. Why is Clean Data Architecture Important?

There are several reasons why clean data architecture is critical to your success. First, the environment in which we work is constantly changing, and the rate of change is increasing. Clean data architectures enable us to better address and support these changes than architectures that are difficult to understand and work with. Second, the problems that we are taking on are complex and growing in complexity over time. Clean data architectures are better suited to enable us to deal with that complexity that those that are not. Third, we must be able to properly govern our data and data solutions. Clean data architectures better enable us to do so because they are easier to understand, and thus easier to monitor and support.

2. Architectural Concerns for Clean Data Architecture

For our architecture to be clean, what architectural concerns should it potentially address? These concerns, which could be thought of as quality of service (QoS) requirement categories, are overviewed in Figure 1 and explored in greater detail below.

Figure 1. Data architecture concerns (click to expand).

There are two types of data architecture concerns:

Strategic concerns. Strategic concerns focus on aspects that enable the long-term viability of your data architecture.
Tactical concerns. Tactical concerns focus on aspects that enable effective implementation of your data architecture.

2.1 Strategic Data Architecture Concerns

Strategic concerns focus on aspects that enable the long-term viability of your data architecture. These concerns, and how they enable one another, are depicted in Figure 2.

Figure 2. Strategic concerns for clean data architecture (click to expand).

These strategic concerns are:

Architectural Concern: Cohesion

Cohesion is a measure of the degree to which the elements of something are functionally related. It is the degree to which all elements within the item are directed towards fulfilling a single purpose. Basically, cohesion is the internal glue that keeps an item together. Table 1 overviews examples of high cohesion, which are good, as well as examples of low cohesion, which are bad. A clean architecture is built from items that are highly cohesive.

Table 1. Examples of cohesive data items.

Item	High Cohesion (Good) Example	Low Cohesion (Bad) Example
Column	Employee.StartDate – Stores the start date of an employee and nothing else.	Person.FirstDate – Stores the start date if they are an employee, the first date they made a purchase with us if they’re a customer, the first time we contacted them if they’re a potential hire, or the first time they wrote about us if they are a media contact.
Table	Employee – A table in 3rd normal form that stores only data about employees.	CompleteEmployee – A table in first normal form that stores employee information, address information, skills information, and more.
Data service	SaveEmployee(data) – A function that saves employee data to its source of record.	Employee(action, data) – A function that reads, writes, or deletes employee records depending on the action indicator.
Data domain	Employee – A logical domain that encompasses the entity types and their relationships for capturing employee information.	Reporting – A logical domain that encompasses the entity types and their relationships that appear in your data warehouse.

Architectural Concern: Consistency

Consistency is a measure of the conformity of something, typically the level of conformity which is necessary for the sake of logic, accuracy, or fairness. In data architecture, consistency is an issue with:

Data values. Table 2 summarizes common level of data consistency.
Naming conventions.
Security access.
Infrastructure. Infrastructure includes items such as server hardware and software configurations.

Table 2. Three levels of data consistency.

Level of Consistency	Implementation
Exact	Data within a data source is completely consistent at all times. Within DBMSs this is typically implemented through ACID (Atomic Consistent Isolated Durable) transactions. ACID transactions are guaranteed to either succeed completely or to fail completely. For example, when $5 is transferred between two accounts as an acid transaction, only one of two things will happen: $5 is debited from the source account and credited in the target account -OR- neither account is updated. It cannot be the case that the $5 is debited but not credited or vice versa.
Eventually consistent	With this strategy there may be a period of time where the data is inconsistent. For example, when $5 is transferred between two accounts it might first be debited from the source account and then later credited in the target account. Or, the $5 is first credited in the target account and then point later debited from the source account.
Good enough	Some discrepancies in the data will be tolerated. For example, with the $5 transfer it may be ok that the target account is credited but the source account doesn’t get debited due to a glitch in our logic. It’s only $5 after all, it’s not worth our effort and nobody’s complaining.

Architectural Concern: Coupling

Coupling is the measure of the degree of interdependence between two or more items. There will always be coupling within your architecture, the goal is to keep it low. We say that items are loosely coupled when there are few dependencies and minimal interaction between them. Table 3 summarizes several types of architectural coupling.

Table 3. Types of potential coupling in data architectures.

Coupling Type	Loose Coupling (Good) Example	High Coupling (Bad) Example
Data to data (relational). Data entities have relationships between them.	The EmployeeTeam relationship table captures the many-to-many relationship between Employee and Team entities.	The Employee table has six foreign key columns to the Team table, one for each of the six different types of relationships that an employee may have with various teams within our organization.
Naming. When we apply naming conventions, which is an important quality strategy, we effectively couple the name of something to that convention and indirectly to the names of other things.	The Employee.StartDate column name conforms to our naming standard that all names should be descriptive and use camel case.	The Employee.EmployeeStartDate column name conforms to our naming convention to include the table name. This couples the column name to the table name, causing us a problem if we ever decide to rename the table.
Program to data. When a program works with data, that program and the source of the data are now coupled to one another.	The Programs access data via a data API (sometimes called an API gateway or encapsulation layer) that implements CRUD (create, read, update, delete) functionality. Only the API logic is coupled to the data.	Programs access data directly for whatever data source is appropriate. The programs are coupled to the schemas of the data sources.
Program to program. Programs will interact with one another, and when they do they will share data and thereby become coupled.	Programs interact via transactions submitted to a message bus that routes the transactions and results accordingly. The programs are now coupled only to the API of the message bus.	Programs interact directly via point-to-point direct access, and are thereby coupled to one another via whatever individual APIs they offer.
Layer to layer. Behavior in one architectural layer will invoke behavior in another. As a result those architectural layers become coupled.	In a three-layer architecture the user interface only interacts with the business layer, the business layer in turn interacts with the data access layer, which then interacts with data sources. Only adjacent layers may interact with one another.	In an unlayered architecture the user interface directly accesses the database for data and the business layer for business functionality. The business layer will also directly access data sources.
Implementation to vendor. When we take advantage of specific vendor functionality then we couple our implementation to that offered by that vendor.	Program logic accesses vendor functionality via an API wrapper service.	Your systems directly invoke functionality that is unique to a given vendor. Better performance, but now you’re coupled directly to it in multiple places.

Architectural Concern: Domain Driven

Organize your architecture around core business concepts, rather than systems or technologies. This is a business-driven approach that aligns your architecture with organizational strategies. At the heart is a domain model depicting the high-level business entity types that are pertinent to your organization, and the relationships between them. Figure 3 depicts an enterprise domain model for a full-service financial institution, showing the primary entity types and the relationships between them.

Figure 3. A enterprise domain model for a financial institution (UML notation, click to expand).

Domain is an over-loaded term. In this context, I’m using domain to mean one of two things: high-level entity types that are critical to your organization -OR- related collections of such entity types and the relationships between them. At a large scale, domain will sometimes be used to refer to the industry that the organization operates in, such as Financial. At a small scale, domain will sometimes be used to represent the collection of values of an attribute, such as a list of the valid values of the days of the week.

These are important considerations about domain models:

Entity types should be cohesive. Each entity type in Figure X, such as Account, may have dozens of entities and relationships between them. These details would be captured via a more detailed domain model.
Relationships indicate coupling.
Domains can drive implementation. The Account domain would potentially be implemented via a high-cohesion strategy, such as micro-services, domain APIs, or large-scale domain components.
Stakeholder collaboration is essential. To understand the domain you will need to work collaborate closely with your stakeholders, ideally taking an active stakeholder participation approach. When domain modeling, your goal is to understand the data aspects of your environment and how that data is used.

Architectural Concern: Evolvability

Evolvability is the capacity to adapt an existing architecture. It is the capability of solutions to be evolved to continue providing value your customers in a cost-effective manner.

In other words,

Evolvability = Extensibility + Refactorability

Where:

Extensibility is the capacity to extend or add to an existing architecture.
Refactorability is the capacity to improve the quality of an existing architecture.

There are three key sources of change that motivate the evolution of your architecture:

Environmental. These are changes motivated by customers, competitors, partners, and suppliers that your organization works with. Customers have new needs, competitors do things you need to address, the priorities and focus of your partners may change, and your suppliers have new or updated offerings that you may choose to leverage.
Organizational. The priorities and ways of working (WoW) within your organization will evolve over time.
Experiential. As the customers of your solutions work with them, they will realize what aspects they like and don’t like, and will often identify new functionality that they believe they desire.

Our aim should be to move away from rigid architectures that are difficult to adapt to changing circumstances to ones that are easy to adapt. A symptom of data architectures that are hard to evolve are “spaghetti architectures” where components are highly coupled to one another, with many data flows and connections between them.

Architectural Concern: Flow

Flow is the ability to go from one place to another consistently, in a steady stream, with large and potentially varying throughput. Your aim is for a continuous flow of data throughout your environment from the source of that data to where it is need.

I use the term “continuous flow” is critical for two reasons:

Data generation is 24/7. New and updated data is coming at you constantly and as a result you need to process it constantly.
Data usage is 24/7. Your stakeholders, both internal and external to your organization, often need to access and work with your data at any time of day.

A significant architectural implication of the desire to achieve continuous flow is that you cannot have batch jobs any more. Continuous flow requires you to accept data when it arrives and then get it to wherever it needs to go at that point in time.

Architectural Concern: Resiliency

Resiliency within your data architecture is the capacity to recover from difficulties quickly, seamlessly, and ideally unbeknownst by end users. In effect resiliency is the toughness of your system.

Strategies that support resilient data architecture include:

Resilient clouds. Cloud service providers have already addressed resiliency issues in their architecture, very likely far better than you ever will. Consider outsourcing your hardware infrastructure.
Server farms. Multi-blade server farms have the potential for greater resiliency, as compared with single-server strategies, resulting from fail-over features.
Automatic server recovery. Your infrastructure should be built to detect when a server goes down and automatically recover it as quickly as possible.
Automatic network recovery. Similarly, your network should also be built to automatically recover when it runs into difficulty.
Automatic rerouting. Data flow within your network should automatically reroute itself when a pathway goes down or hits capacity.
RAID storage. Build your data storage solution from RAID (redundant array of independent disks) technology.
ACID transactions. To ensure that your data is consistent, critical transactions should be ACID (atomic, consistent, isolated, and durable).

Architectural Concern: Scalability

Scalability is the measure of a system’s ability to vary its performance and cost in response to changes in demand. Scalability is often thought about as the need to increase in capacity as demand for it increases, but is it also the ability to shrink capacity as demand for it decreases. Regarding data, scalability enables your architecture to react to changing volumes and rates of incoming data as well as changing volumes of data usage.

Scalability is critical for two reasons:

Support for variable volume. Sometimes data trickles in and sometimes it comes in as a raging torrent. Sometimes your stakeholders rarely access your data and other times they want to work with huge amounts of it. In short, your architecture must be able to support the changing volume of data as it varies.
Support for increasing volume. Over time, the volume of incoming data and of requested data is likely to rise. This applies to minimum volume at any given time, average volume, and maximum volume. The implication is that your data architecture must scale in pace with the increasing volume.

Issues to consider when architecting for scalability:

Do you need to cluster?
Do you need to support multiple physical locations globally?
How will your data storage automatically reconfigure itself (think cloud-based storage or internal server farms).
How will your network pathways reconfigure themselves as demand varies?

2.2 Tactical Data Architecture Concerns

Tactical concerns focus on aspects that enable effective implementation of your data architecture. These tactical concerns are:

Architectural Concern: Consumability

Consumability is the combination of three required aspects:

Functionality. Does the data required by your stakeholders exist and is available to them?
Usability. Is the data accessible by your stakeholders in a manner that is appropriate for them? Do they have tools that make the data easy to work with? Is the data presented to them in a manner that reflects their needs? Are data elements named and described in a manner understandable by them?
Desirability. Do stakeholders have the data, information, and tooling that they want to work with?

Architectural Concern: Deployment

Deployment is the mechanism through which applications, modules, updates, and patches are delivered by a development team into production to be available for use. The practices used by developers to build, test and deploy new functionality and information will impact how fast and how often they can deploy. Agile practices that enable effective deployment include automated regression testing, continuous integration (CI), and continuous deployment (CD).

Deployment practices, in turn, are constrained by the architecture of your system. To support agile ways of working, you may need to architect:

A versioning system for your data sources. Practices such as continuous database integration (CDI) require that data sources know their current version so that data schema changes can be applied appropriately.
Data source switchover. To support data evolution and update your data sources may need to be able to run in parallel to enable one version to be updated while the other is operational. This enables fundamental hot switchover (blue/green) deployment strategies.

Architectural Concern: Fit-for-purpose technology

Contrary to popular opinion in the clothing industry, one size does not fit all. There is a variety of data processing needs, including but not limited to high-volume transactions, large-volume data queries, complex data traversal, and data “blob” consumption. The implication is that your data architecture needs to support a wide range of needs, implying the need to support an adequate range of technologies to do so. As a result you may need to adopt many different types of data technologies (such as relational, hierarchical, no-SQL databases) offered by several vendors.

But, this doesn’t mean that it should be a technology free-for-all. Remember that you are constrained by your organizational ability to maintain and evolve these technologies over time. How many technologies can your organization reasonably work with over time? For a given technology, will you be able to attract and retain people with sufficient skills to work with it? Will the technology stand the test of time?

Architectural Concern: Integration

An integration is the connections between two components so that they can work together as a whole to share information and data. Integrations are built on APIs (application programming interfaces) and allow for the flow of information between components, connecting the hardware and software components together so everything can be used in unison.

In simple environments, those with a handful of components, direct integrations between components will get the job done. However, as the number of components grows so does the number of potential direct integrations. You very quickly find that you need to consider an indirect integration strategy, such as an information broker or information bus to integrate the various components within your ecosystem.

Architectural Concern: Latency

Data latency refers to the time from an initial request to the response. There are three components of data latency:

Processing time. The time that it takes to “crunch” the data.
Queuing delay. The amount of time that data waits to be processed or transmitted.
Transmission time. The time it takes for data to travel from one endpoint to another.

In business intelligence (BI), data latency is how long it takes for a business user to retrieve source data from a data warehouse (DW) or BI dashboard. The fundamental question that architects need to answer is how fast does access need to be?

Architectural Concern: Real-time

A real-time application (RTA) is an application that functions within a time frame that the user senses as immediate or current. For applications to work in a real-time manner their data must be available in real-time, which implies that underlying data architectures must work in a real-time manner.

Real-time has very interesting implications for the architecture of data warehouses (DW) and business intelligence (BI) solutions that work with data from a wide range of data sources. If any of the incoming data is coming in batch, or comes in after a significant latency (say more than a few seconds), then your DW/BI solution isn’t an RTA. Luckily, the architectural advice of the DataVault 2method includes strategies to develop real-time DW/BI solutions at scale.

Architectural Concern: Security

Data security is the process of safeguarding digital information to protect it from corruption, theft, or unauthorized access. This potentially includes, but is not limited to, the following concerns:

Access control. Who should access which data? This includes limiting both physical and digital access to critical systems and data.
Authentication. Ensure that you know who is accessing data. Includes alerting authorities when unauthorized access is attempted.
Availability. Ensures that data is readily — and safely — accessible and available for ongoing business needs.
Data auditing. Record/log who is accessing which data, and when they are doing so.
Data erasure. How is data deleted? Is it hard deleted (completely removed) or soft deleted (marked as no longer active)? Is deletion of data logged for historical purposes?
Data masking. The process of modifying sensitive data so that it is of no or little value to unauthorized intruders while still being usable by IT professionals. Also known as data obfuscation.
Encryption. The translation of data from plaintext (unencrypted) to ciphertext (encrypted). Encrypted data can only be processed after it’s been decrypted.
Integrity. Ensure that all data stored is reliable, accurate, and not subject to unwarranted changes.
Privacy/confidentiality. Ensures that data is accessed only by authorized users with the proper credentials, an aspect of access control that is often driven by regulations. Do customers have a “right to be forgotten” requiring deletion of private data?

Architectural Concern: Throughput

Throughput is a measure of how many units of information a component, such as a system or network connection, can process in a given amount of time. As the number of components within your ecosystem grows, as the amount of data that your stakeholders produce grows, as their usage grows, and as the amount of incoming data from external sources grows, so must the throughput capacity of your data architecture.

3. Clean Data Architecture in Context

The following table summarizes the trade-offs associated with the strategy of having a clean data architecture and provides advice for when (not) to adopt it.

Advantages	Easier to understand Easier to evolve, thereby enabling agility Easier to validate
Disadvantages	Requires investment to keep clean, including in architectural modeling and architectural refactoring Existing legacy architectures often have significant technical debt that needs to be addressed before your architecture is sufficiently clean
When to Adopt This Practice	My knee jerk reaction is to say always, but that wouldn’t be accurate. Sometimes time is of the essence and it makes sense to accept technical debt now and decide to pay it down in the future. Hopefully that is rare decision that when it is made is a prudent and deliberate one.