Coupling is the measure of the degree of interdependence between two or more items. There will always
be coupling within your architecture, the goal is to keep it low. We say that items are loosely coupled
when there are few dependencies and minimal interaction between them.
Table 3 summarizes several types of architectural coupling.
Coupling Type
|
Loose Coupling (Good) Example
|
High Coupling (Bad) Example
|
Data to data (relational). Data entities have relationships between them.
|
The EmployeeTeam relationship table captures the many-to-many relationship between Employee and
Team entities.
|
The Employee table has six foreign key columns to the Team table, one for each of the six
different types of relationships that an employee may have with various teams within our
organization.
|
Naming. When we apply naming conventions, which is an important quality strategy, we
effectively couple the name of something to that convention and indirectly to the names of other
things.
|
The Employee.StartDate column name conforms to our naming standard that all names should be
descriptive and use camel case.
|
The Employee.EmployeeStartDate column name conforms to our naming convention to include the
table name. This couples the column name to the table name, causing us a problem if we ever
decide to rename the table.
|
Program to data. When a program works with data, that program and the source of the data
are now coupled to one another.
|
The Programs access data via a data API (sometimes called an API gateway or encapsulation layer)
that implements CRUD (create, read, update, delete) functionality. Only the API logic is coupled
to the data.
|
Programs access data directly for whatever data source is appropriate. The programs are coupled
to the schemas of the data sources.
|
Program to program. Programs will interact with one another, and when they do they will
share data and thereby become coupled.
|
Programs interact via transactions submitted to a message bus that routes the transactions and
results accordingly. The programs are now coupled only to the API of the message bus.
|
Programs interact directly via point-to-point direct access, and are thereby coupled to one
another via whatever individual APIs they offer.
|
Layer to layer. Behavior in one architectural layer will invoke behavior in another. As a
result those architectural layers become coupled.
|
In a three-layer architecture the user interface only interacts with the business layer, the
business layer in turn interacts with the data access layer, which then interacts with data
sources. Only adjacent layers may interact with one another.
|
In an unlayered architecture the user interface directly accesses the database for data and the
business layer for business functionality. The business layer will also directly access data
sources.
|
Implementation to vendor. When we take advantage of specific vendor functionality then we
couple our implementation to that offered by that vendor.
|
Program logic accesses vendor functionality via an API wrapper service.
|
Your systems directly invoke functionality that is unique to a given vendor. Better performance,
but now you're coupled directly to it in multiple places.
|
Architectural Concern: Domain Driven
Organize your architecture around core business concepts, rather than systems or technologies.
This is a business-driven approach that aligns your architecture with organizational strategies. At
the heart is a domain model depicting the high-level business entity types that are pertinent to
your organization, and the relationships between them.
Figure 3 depicts an enterprise domain model for a full-service
financial institution, showing the primary entity types and the relationships between them.
Figure 3. A enterprise domain model for a financial institution (UML notation).
TBD.
Domain is an over-loaded term. In this context, I'm using domain to mean one of two things:
high-level entity types that are critical to your organization -OR- related collections of such
entity types and the relationships between them. At a large scale, domain will sometimes be used to
refer to the industry that the organization operates in, such as Financial. At a small scale, domain
will sometimes be used to represent the collection of values of an attribute, such as a list of the
valid values of the days of the week.
These are important considerations about domain models:
- Entity types should be cohesive. Each entity type in Figure X, such as Account, may
have dozens of entities and relationships between them. These details would be captured via
a more detailed domain model.
- Relationships indicate coupling.
- Domains can drive implementation. The Account domain would potentially be implemented
via a high-cohesion strategy, such as micro-services, domain APIs, or large-scale domain
components.
- Stakeholder collaboration is essential. To understand the domain you will need to
work collaborate closely with your stakeholders, ideally taking an
active stakeholder participation approach. When domain modeling,
your goal is to understand the data aspects of your environment and how that data is used.
Architectural Concern: Evolvability
Evolvability is the capacity to adapt an existing architecture. It is the capability of solutions
to be evolved to continue providing value your customers in a cost-effective manner.
In other words,
- Evolvability = Extensibility + Refactorability
Where:
- Extensibility is the capacity to extend or add to an existing architecture.
- Refactorability is the capacity to improve the quality of an existing architecture.
There are three key sources of change that motivate the evolution of your architecture:
- Environmental. These are changes motivated by customers, competitors, partners, and
suppliers that your organization works with. Customers have new needs, competitors do things
you need to address, the priorities and focus of your partners may change, and your
suppliers have new or updated offerings that you may choose to leverage.
- Organizational. The priorities and ways of working (WoW) within your organization
will evolve over time.
- Experiential. As the customers of your solutions work with them, they will realize
what aspects they like and don't like, and will often identify new functionality that they
believe they desire.
Our aim should be to move away from rigid architectures that are difficult to adapt to changing
circumstances to ones that are easy to adapt. A symptom of data architectures that are hard to
evolve are "spaghetti architectures" where components are highly coupled to one another, with many
data flows and connections between them.
Architectural Concern: Flow
Flow is the ability to go from one place to another consistently, in a steady stream, with large and
potentially varying throughput. Your aim is for a continuous flow of data throughout your
environment from the source of that data to where it is need.
I use the term “continuous flow” is critical for two reasons:
- Data generation is 24/7. New and updated data is coming at you constantly and as a
result you need to process it constantly.
- Data usage is 24/7. Your stakeholders, both internal and external to your
organization, often need to access and work with your data at any time of day.
A significant architectural implication of the desire to achieve continuous flow is that you
cannot have batch jobs any more. Continuous flow requires you to accept data when it arrives and
then get it to wherever it needs to go at that point in time.
Architectural Concern: Resiliency
Resiliency within your data architecture is the capacity to recover from difficulties quickly,
seamlessly, and ideally unbeknownst by end users. In effect resiliency is the toughness of your
system.
Strategies that support resilient data architecture include:
- Resilient clouds. Cloud service providers have already addressed resiliency issues in
their architecture, very likely far better than you ever will. Consider outsourcing your
hardware infrastructure.
- Server farms. Multi-blade server farms have the potential for greater resiliency, as
compared with single-server strategies, resulting from fail-over features.
- Automatic server recovery. Your infrastructure should be built to detect when a
server goes down and automatically recover it as quickly as possible.
- Automatic network recovery. Similarly, your network should also be built to
automatically recover when it runs into difficulty.
- Automatic rerouting. Data flow within your network should automatically reroute
itself when a pathway goes down or hits capacity.
- RAID storage. Build your data storage solution from RAID (redundant array of
independent disks) technology.
- ACID transactions. To ensure that your data is consistent, critical transactions
should be ACID (atomic, consistent, isolated, and durable).
Architectural Concern: Scalability
Scalability is the measure of a system's ability to vary its performance and cost in response to
changes in demand. Scalability is often thought about as the need to increase in capacity as demand
for it increases, but is it also the ability to shrink capacity as demand for it decreases.
Regarding data, scalability enables your architecture to react to changing volumes and rates of
incoming data as well as changing volumes of data usage.
Scalability is critical for two reasons:
- Support for variable volume. Sometimes data trickles in and sometimes it comes in as
a raging torrent. Sometimes your stakeholders rarely access your data and other times they
want to work with huge amounts of it. In short, your architecture must be able to support
the changing volume of data as it varies.
- Support for increasing volume. Over time, the volume of incoming data and of
requested data is likely to rise. This applies to minimum volume at any given time, average
volume, and maximum volume. The implication is that your data architecture must scale in
pace with the increasing volume.
Issues to consider when architecting for scalability:
- Do you need to cluster?
- Do you need to support multiple physical locations globally?
- How will your data storage automatically reconfigure itself (think cloud-based storage or
internal server farms).
- How will your network pathways reconfigure themselves as demand varies?
2.2 Tactical Data Architecture Concerns
Tactical concerns focus on aspects that enable effective implementation of your data architecture.
These tactical concerns are:
Architectural Concern: Consumability
Consumability is the combination of three required aspects:
- Functionality. Does the data required by your stakeholders exist and is available to
them?
- Usability. Is the data accessible by your stakeholders in a manner that is
appropriate for them? Do they have tools that make the data easy to work with? Is the data
presented to them in a manner that reflects their needs? Are data elements named and
described in a manner understandable by them?
- Desirability. Do stakeholders have the data, information, and tooling that they want
to work with?
Architectural Concern: Deployment
Deployment is the mechanism through which applications, modules, updates, and patches are
delivered by a development team into production to be available for use. The practices used by
developers to build, test and deploy new functionality and information will impact how fast and how
often they can deploy. Agile practices that enable effective deployment include
automated regression testing,
continuous integration (CI), and continuous deployment (CD).
Deployment practices, in turn, are constrained by the architecture of your system. To support
agile ways of working, you may need to architect:
- A versioning system for your data sources. Practices such as continuous database
integration (CDI) require that data sources know their current version so that data
schema changes can be applied appropriately.
- Data source switchover. To support data evolution and update your data sources may
need to be able to run in parallel to enable one version to be updated while the other is
operational. This enables fundamental hot switchover (blue/green) deployment strategies.
Architectural Concern: Fit-for-purpose technology
Contrary to popular opinion in the clothing industry, one size does not fit all. There is a
variety of data processing needs, including but not limited to high-volume transactions,
large-volume data queries, complex data traversal, and data “blob” consumption. The implication is
that your data architecture needs to support a wide range of needs, implying the need to support an
adequate range of technologies to do so. As a result you may need to adopt many different types of
data technologies (such as relational, hierarchical, no-SQL databases) offered by several vendors.
But, this doesn’t mean that it should be a technology free-for-all. Remember that you are
constrained by your organizational ability to maintain and evolve these technologies over time. How
many technologies can your organization reasonably work with over time? For a given technology, will
you be able to attract and retain people with sufficient skills to work with it? Will the technology
stand the test of time?
Architectural Concern: Integration
An integration is the connections between two components so that they can work together as a
whole to share information and data. Integrations are built on APIs (application programming
interfaces) and allow for the flow of information between components, connecting the hardware and
software components together so everything can be used in unison.
In simple environments, those with a handful of components, direct integrations between
components will get the job done. However, as the number of components grows so does the number of
potential direct integrations. You very quickly find that you need to consider an indirect
integration strategy, such as an information broker or information bus to integrate the various
components within your ecosystem.
Architectural Concern: Latency
Data latency refers to the time from an initial request to the response. There are three
components of data latency:
- Processing time. The time that it takes to “crunch” the data.
- Queuing delay. The amount of time that data waits to be processed or transmitted.
- Transmission time. The time it takes for data to travel from one endpoint to another.
In business intelligence (BI), data latency is how long it takes for a business user to retrieve
source data from a data warehouse (DW) or BI dashboard. The fundamental question that architects
need to answer is how fast does access need to be?
Architectural Concern: Real-time
A real-time application (RTA) is an application that functions within a time frame that the user
senses as immediate or current. For applications to work in a real-time manner their data must be
available in real-time, which implies that underlying data architectures must work in a real-time
manner.
Real-time has very interesting implications for the architecture of
data warehouses (DW) and business
intelligence (BI)
solutions that work with data from a wide range of data sources. If any of the incoming data is
coming in batch, or comes in after a significant latency (say more than a few seconds), then your
DW/BI solution isn’t an RTA. Luckily, the architectural advice of the DataVault 2method includes strategies to develop
real-time DW/BI solutions at scale.
Architectural Concern: Security
Data security is the process of safeguarding digital information to protect it from corruption,
theft, or unauthorized access. This potentially includes, but is not limited to, the following
concerns:
- Access control. Who should access which data? This includes limiting both physical and
digital access to critical systems and data.
- Authentication. Ensure that you know who is accessing data. Includes alerting authorities
when unauthorized access is attempted.
- Availability. Ensures that data is readily — and safely — accessible and available for ongoing
business needs.
- Data auditing. Record/log who is accessing which data, and when they are doing so.
- Data erasure. How is data deleted? Is it hard deleted (completely removed) or soft deleted
(marked as no longer active)? Is deletion of data logged for historical purposes?
- Data masking. The process of modifying sensitive data so that it is of no or little value to
unauthorized intruders while still being usable by IT professionals. Also known as data
obfuscation.
- Encryption. The translation of data from plaintext (unencrypted) to ciphertext (encrypted).
Encrypted data can only be processed after it's been decrypted.
- Integrity. Ensure that all data stored is reliable, accurate, and not subject to unwarranted
changes.
- Privacy/confidentiality. Ensures that data is accessed only by authorized users with the
proper credentials, an aspect of access control that is often driven by regulations. Do
customers have a “right to be forgotten” requiring deletion of private data?
Architectural Concern: Throughput
Throughput is a measure of how many units of information a component, such as a system or network
connection, can process in a given amount of time. As the number of components within your ecosystem
grows, as the amount of data that your stakeholders produce grows, as their usage grows, and as the
amount of incoming data from external sources grows, so must the throughput capacity of your data
architecture.
3. Clean Data Architecture in Context
The following table summarizes the trade-offs associated with the strategy of having a clean data
architecture and provides advice for when (not) to adopt it.
Advantages
|
- Easier to understand
- Easier to evolve, thereby enabling agility
- Easier to validate
|
Disadvantages
|
- Requires investment to keep clean, including in architectural modeling and
architectural refactoring
- Existing legacy architectures often have significant technical debt that needs to be
addressed before your architecture is sufficiently clean
|
When to Adopt This Practice
|
My knee jerk reaction is to say always, but that wouldn't be accurate. Sometimes time is of
the essence and it makes sense to
accept
technical debt
now and decide to pay it down in the future. Hopefully that is rare decision that when it is
made is a prudent and deliberate one.
|
4. Related Resources