Agile Data Logo

Clean Data Architecture: Architectural Concerns

Follow @scottwambler on Twitter!

This page is a work in progress.

Data architecture addresses the data aspects of a system or enterprise. A clean data architecture is one that easy to understand, to implement, and to evolve. To create and maintain a clean data architecture, you must make trade offs between a collection of architectural concerns such as coupling, cohesion, security, scalability, resiliency, and many others.

This article addresses the following topics:

  1. Why is clean data architecture important?
  2. Architectural concerns for clean data architecture
  3. Clean data architecture in context
  4. Related resources

1. Why is Clean Data Architecture Important?

There are several reasons why clean data architecture is critical to your success. First, the environment in which we work is constantly changing, and the rate of change is increasing. Clean data architectures enable us to better address and support these changes than architectures that are difficult to understand and work with. Second, the problems that we are taking on are complex and growing in complexity over time. Clean data architectures are better suited to enable us to deal with that complexity that those that are not. Third, we must be able to properly govern our data and data solutions. Clean data architectures better enable us to do so because they are easier to understand, and thus easier to monitor and support.


2. Architectural Concerns for Clean Data Architecture

For our architecture to be clean, what architectural concerns should it potentially address? These concerns, which could be thought of as quality of service (QoS) requirement categories, are overviewed in Figure 1 and explored in greater detail below.


Figure 1. Data architecture concerns.

Clean data architecture concerns

There are two types of data architecture concerns:

  1. Strategic concerns. Strategic concerns focus on aspects that enable the long-term viability of your data architecture.
  2. Tactical concerns. Tactical concerns focus on aspects that enable effective implementation of your data architecture.

2.1 Strategic Data Architecture Concerns

Strategic concerns focus on aspects that enable the long-term viability of your data architecture. These concerns, and how they enable one another, are depicted in Figure 2.


Figure 2. Strategic concerns for clean data architecture.

Clean data architecture strategic concerns

These strategic concerns are:


Architectural Concern: Cohesion

Cohesion is a measure of the degree to which the elements of something are functionally related. It is the degree to which all elements within the item are directed towards fulfilling a single purpose. Basically, cohesion is the internal glue that keeps an item together. Table 1 overviews examples of high cohesion, which are good, as well as examples of low cohesion, which are bad. A clean architecture is built from items that are highly cohesive.

Table 1. Examples of cohesive data items.

Item High Cohesion (Good) Example Low Cohesion (Bad) Example
Column Employee.StartDate - Stores the start date of an employee and nothing else. Person.FirstDate - Stores the start date if they are an employee, the first date they made a purchase with us if they're a customer, the first time we contacted them if they're a potential hire, or the first time they wrote about us if they are a media contact.
Table Employee - A table in 3rd normal form that stores only data about employees. CompleteEmployee - A table in first normal form that stores employee information, address information, skills information, and more.
Data service SaveEmployee(data) - A function that saves employee data to its source of record. Employee(action, data) - A function that reads, writes, or deletes employee records depending on the action indicator.
Data domain Employee - A logical domain that encompasses the entity types and their relationships for capturing employee information. Reporting - A logical domain that encompasses the entity types and their relationships that appear in your data warehouse.

Architectural Concern: Consistency

Consistency is a measure of the conformity of something, typically the level of conformity which is necessary for the sake of logic, accuracy, or fairness. In data architecture, consistency is an issue with:

  • Data values. Table 2 summarizes common level of data consistency.
  • Naming conventions.
  • Security access.
  • Infrastructure. Infrastructure includes items such as server hardware and software configurations.

Table 2. Three levels of data consistency.

Level of Consistency Implementation
Exact Data within a data source is completely consistent at all times. Within DBMSs this is typically implemented through ACID (Atomic Consistent Isolated Durable) transactions. ACID transactions are guaranteed to either succeed completely or to fail completely. For example, when $5 is transferred between two accounts as an acid transaction, only one of two things will happen: $5 is debited from the source account and credited in the target account -OR- neither account is updated. It cannot be the case that the $5 is debited but not credited or vice versa.
Eventually consistent With this strategy there may be a period of time where the data is inconsistent. For example, when $5 is transferred between two accounts it might first be debited from the source account and then later credited in the target account. Or, the $5 is first credited in the target account and then point later debited from the source account.
Good enough Some discrepancies in the data will be tolerated. For example, with the $5 transfer it may be ok that the target account is credited but the source account doesn't get debited due to a glitch in our logic. It's only $5 after all, it's not worth our effort and nobody's complaining.

Architectural Concern: Coupling

Coupling is the measure of the degree of interdependence between two or more items. There will always be coupling within your architecture, the goal is to keep it low. We say that items are loosely coupled when there are few dependencies and minimal interaction between them. Table 3 summarizes several types of architectural coupling.

Table 3. Types of potential coupling in data architectures.

Coupling Type Loose Coupling (Good) Example High Coupling (Bad) Example
Data to data (relational). Data entities have relationships between them. The EmployeeTeam relationship table captures the many-to-many relationship between Employee and Team entities. The Employee table has six foreign key columns to the Team table, one for each of the six different types of relationships that an employee may have with various teams within our organization.
Naming. When we apply naming conventions, which is an important quality strategy, we effectively couple the name of something to that convention and indirectly to the names of other things. The Employee.StartDate column name conforms to our naming standard that all names should be descriptive and use camel case. The Employee.EmployeeStartDate column name conforms to our naming convention to include the table name. This couples the column name to the table name, causing us a problem if we ever decide to rename the table.
Program to data. When a program works with data, that program and the source of the data are now coupled to one another. The Programs access data via a data API (sometimes called an API gateway or encapsulation layer) that implements CRUD (create, read, update, delete) functionality. Only the API logic is coupled to the data. Programs access data directly for whatever data source is appropriate. The programs are coupled to the schemas of the data sources.
Program to program. Programs will interact with one another, and when they do they will share data and thereby become coupled. Programs interact via transactions submitted to a message bus that routes the transactions and results accordingly. The programs are now coupled only to the API of the message bus. Programs interact directly via point-to-point direct access, and are thereby coupled to one another via whatever individual APIs they offer.
Layer to layer. Behavior in one architectural layer will invoke behavior in another. As a result those architectural layers become coupled. In a three-layer architecture the user interface only interacts with the business layer, the business layer in turn interacts with the data access layer, which then interacts with data sources. Only adjacent layers may interact with one another. In an unlayered architecture the user interface directly accesses the database for data and the business layer for business functionality. The business layer will also directly access data sources.
Implementation to vendor. When we take advantage of specific vendor functionality then we couple our implementation to that offered by that vendor. Program logic accesses vendor functionality via an API wrapper service. Your systems directly invoke functionality that is unique to a given vendor. Better performance, but now you're coupled directly to it in multiple places.

Architectural Concern: Domain Driven

Organize your architecture around core business concepts, rather than systems or technologies. This is a business-driven approach that aligns your architecture with organizational strategies. At the heart is a domain model depicting the high-level business entity types that are pertinent to your organization, and the relationships between them. Figure 3 depicts an enterprise domain model for a full-service financial institution, showing the primary entity types and the relationships between them.

Figure 3. A enterprise domain model for a financial institution (UML notation).

TBD.

Domain is an over-loaded term. In this context, I'm using domain to mean one of two things: high-level entity types that are critical to your organization -OR- related collections of such entity types and the relationships between them. At a large scale, domain will sometimes be used to refer to the industry that the organization operates in, such as Financial. At a small scale, domain will sometimes be used to represent the collection of values of an attribute, such as a list of the valid values of the days of the week.

These are important considerations about domain models:

  1. Entity types should be cohesive. Each entity type in Figure X, such as Account, may have dozens of entities and relationships between them. These details would be captured via a more detailed domain model.
  2. Relationships indicate coupling.
  3. Domains can drive implementation. The Account domain would potentially be implemented via a high-cohesion strategy, such as micro-services, domain APIs, or large-scale domain components.
  4. Stakeholder collaboration is essential. To understand the domain you will need to work collaborate closely with your stakeholders, ideally taking an active stakeholder participation approach. When domain modeling, your goal is to understand the data aspects of your environment and how that data is used.

Architectural Concern: Evolvability

Evolvability is the capacity to adapt an existing architecture. It is the capability of solutions to be evolved to continue providing value your customers in a cost-effective manner.

In other words,

  • Evolvability = Extensibility + Refactorability

Where:

  • Extensibility is the capacity to extend or add to an existing architecture.
  • Refactorability is the capacity to improve the quality of an existing architecture.

There are three key sources of change that motivate the evolution of your architecture:

  1. Environmental. These are changes motivated by customers, competitors, partners, and suppliers that your organization works with. Customers have new needs, competitors do things you need to address, the priorities and focus of your partners may change, and your suppliers have new or updated offerings that you may choose to leverage.
  2. Organizational. The priorities and ways of working (WoW) within your organization will evolve over time.
  3. Experiential. As the customers of your solutions work with them, they will realize what aspects they like and don't like, and will often identify new functionality that they believe they desire.

Our aim should be to move away from rigid architectures that are difficult to adapt to changing circumstances to ones that are easy to adapt. A symptom of data architectures that are hard to evolve are "spaghetti architectures" where components are highly coupled to one another, with many data flows and connections between them.


Architectural Concern: Flow

TBD


Architectural Concern: Resiliency

TBD


Architectural Concern: Scalability

  • Do you need to cluster?
  • Support multiple physical locations globally?

2.2 Tactical Data Architecture Concerns

Tactical concerns focus on aspects that enable effective implementation of your data architecture. These tactical concerns are:


Architectural Concern: Deployment

TBD


Architectural Concern: Fit-for-purpose technology

  • SQL
  • Hierarchical
  • No-SQL
  • Files

Architectural Concern: Integration

TBD


Architectural Concern: Latency

How fast does access need to be?


Architectural Concern: Real-time

TBD


Architectural Concern: Security

  • Who should access which data?
  • Encryption

Architectural Concern: Throughput

TBD


Architectural Concern: Usage

TBD


3. Clean Data Architecture in Context

The following table summarizes the trade-offs associated with the strategy of having a clean data architecture and provides advice for when (not) to adopt it.

Advantages
  • Easier to understand
  • Easier to evolve, thereby enabling agility
  • Easier to validate
Disadvantages
  • Requires investment to keep clean, including in architectural modeling and architectural refactoring
  • Existing legacy architectures often have significant technical debt that needs to be addressed before your architecture is sufficiently clean
When to Adopt This Practice My knee jerk reaction is to say always, but that wouldn't be accurate. Sometimes time is of the essence and it makes sense to accept technical debt now and decide to pay it down in the future. Hopefully that is rare decision that when it is made is a prudent and deliberate one.


4. Related Resources