Sometimes you are in a position to develop your data schema from
scratch when you are developing a new system using object-oriented technologies.
If so, consider yourself amongst the lucky few because the vast majority of
developers are often forced to tolerate one or more existing legacy data
designs. Worse yet, it is often
presumed that these data sources cannot be improved because of the corresponding
changes that would be required to the legacy applications that currently access
them. The problems presented by
legacy data sources are often too difficult to fix immediately, therefore you
have to learn to work around them.
The goal of this article is to introduce both application
developers and Agile data engineers to the realities of working with legacy data.
For our purposes any computer artifact, including but not limited to data
and software, is considered to be a legacy asset once it is deployed and in
production. For example, the C#
application and its XML database that you deployed last week are now considered
to be legacy assets even though they are the built from the most modern
technologies within your organization. A
legacy data source is any file, database, or software asset (such as a web
service or business application) that supplies or produces data and that has already been deployed.
For the sake of brevity we will only focus on the data aspects of legacy
software assets.
The need to work with legacy data constrains a development team.
It reduces their flexibility because they cannot easily manipulate the
source data schema to reflect the needs of their object schema (see Mapping
Objects to RDBs).
Legacy data often doesn't provide the full range of information
required by the team because the data does not reflect their new requirements.
Legacy data is often constrained itself by the other applications that
work with it, constraints that are then put on your team.
Legacy data is often difficult to work with because of a combination of
quality, design, architecture, or political issues.
Table of Contents
-
Sources of Legacy Data
-
Common Problems With Legacy Data
-
Strategies for Working With Legacy Data
-
Data Integration
Technologies
-
How Common is This?
-
What You Have Learned
1. Sources of Legacy Data
Where does legacy data come from?
Virtually everywhere. Figure
1 indicates that there are many sources from which you may obtain legacy
data. This includes existing
databases, often
relational, although non-RDBs such as
hierarchical, network, object,
XML, object/relational databases, and NoSQL databases.
Files, such as XML documents or "flat files" such as configuration
files and comma-delimited text files, are also common sources of legacy data.
Software, including legacy applications that have been wrapped (perhaps
via CORBA) and legacy services such as web services or CICS transactions, can
also provide access to existing information.
The point to be made is that there is often far more to gaining access to
legacy data than simply writing an SQL query against an existing relational
database.
Figure 1. Legacy data
sources.

What type of problems are you likely to experience with
legacy data? There are three
technical issues, all of which contribute to your organization's technical debt
(arguably technical debt associated with legacy data sources should be referred
to as data debt), and one non-technical issue to be concerned with:
- Data quality
challenges
- Database design problems
- Data
architecture problems
- Process-related
challenges
Table 1 lists the most
common problems that you may encounter, indicating potential database
refactorings (see the database
refactoring catalog) that you could apply to resolve the problem. It is
important to understand that any given data source may suffer from several of
these problems, and sometimes a single data column/field may even experience
several problems.
Agile data engineers will work with application programmers to identify
their data needs, to then identify potential sources for that data, and in the
case of legacy data to help them to access that data. Part of the job of
accessing the data is to help application developers to transform and cleanse
the data to make it usable. Agile data engineers will be aware of the potential
problems summarized in Table 1 and will work closely with the application programmers to overcome the
challenges.
Table 1.
Typical Legacy Data Problems.
Problem |
Potential Database Refactorings |
A single column is used for several purposes |
Split Column (to Notes) |
The purpose of a column is determined by the value of
one or more other columns |
Remove Unused Column (to remove DateType)
Split Column (to PersonDate) |
Inconsistent data values |
Introduce Trigger(s) for Calculated Column (between BirthDate
and AgeInYears)
Remove Redundant Column (to AgeInYears) |
Inconsistent/incorrect data formatting |
Introduce Common Format |
Missing data |
N/A |
Missing columns |
N/A |
Additional columns |
Introduce Default Value to a Column
Remove Redundant Column |
Multiple sources for the same data |
N/A |
Important entities, attributes, and relationships are
hidden and floating in text fields |
Replace Blob With Table
Split Column |
Data values that stray from their field descriptions
and business rules |
Split Column |
Various
key strategies for the same type of entity |
Consolidate Key Strategy For Entity |
Unrealized relationships between data records |
Introduce Explicit Relationship |
One attribute is stored in several fields |
Combine Columns Representing a Single Concept |
Inconsistent use of special characters |
Introduce Common Format |
Different data types for similar columns |
Apply Standard Types to Similar Data |
Different levels of detail |
Introduce Calculated Column
Replace Column |
Different modes of operation |
Separate Read-Only Data |
Varying timeliness of data |
Separate Data Based on Timeliness |
Varying default values |
Introduce Default Value to a Column |
Various representations |
Apply Standard Codes
Apply Standard Types to Similar Data |
The second problem with legacy data sources that Agile data engineers
need to be aware of are fundamental design problems. Existing data designs, or even new data designs, are rarely
perfect and often suffer from significant challenges.
Common data design problems you will
likely discover:
- Database encapsulation scheme exists, but it's
difficult to use
- Ineffective (or no) naming conventions
- Inadequate documentation
- Original design goals at odds with current
needs
- Inconsistent
key strategy
|
 |
These design problems may be the result of poor database design in the
first place, perhaps the designers did not have a very good understanding of data
modeling. Sometimes the initial
design of a data source was very good but over time the quality degraded as
ill-advised schema changes were made, something referred to as schema entropy. Once again, the Agile data engineer will need to work closely with
application programmers to overcome these problems.
Their past experience dealing with similar design problems,
as well as their personal relationship with the owners of the legacy data
source(s), will prove to be a valuable asset to the team.
2.3 Common Data Architecture Problems
Agile data engineers need to be aware of the problems with the data
architecture within your enterprise, information that they will often gain
through discussions with enterprise architects.
These problems typically
result from development teams not conforming to an enterprise architectural vision
(such a vision seldom exists) or because the team simply wasn't aware
of data architectural issues. Some of the potential data architecture problems that you may discover include:
-
Applications responsible for data cleansing
-
Different database paradigms
-
Different hardware platforms
-
Different storage devices
-
Fragmented data sources
-
Inaccessible data
-
Inconsistent semantics
-
Inflexible architecture
-
Lack of event notification
-
Redundant data sources
-
No or inefficient security
-
Varying timeliness of data sources
A common implication of these
architecture problems is that you need to put an effective data access approach
in place such as introducing a staging database or a robust data encapsulation
strategy. Staging databases are discussed below
and encapsulation
strategies are covered in another chapter.
2.4 Common Software Process Problems When
Working With Legacy Data
The technical challenges associated with legacy data are
bad enough, although unfortunately non-technical ones often overshadow them.
The most difficult aspect of software development is to get people to
work together effectively, and dealing with legacy data is no exception.
Organizations will often hobble development teams because they are
unable, or unwilling, to define and then work towards an effective vision.
When it comes to working with legacy data there are several common
process-oriented mistakes that I have seen organizations make:
-
Working with legacy data when you don't need to.
-
Data design drives your object model.
-
Legacy data issues overshadow everything else.
-
Application developers ignore legacy data issues.
-
You choose to not
refactor the legacy data
source.
-
Politics.
-
You don't see the software forest for the legacy data trees.
-
You don't put contract models in place.
3. Strategies for Working With Legacy Data
My assumption in this section is that your team needs to access one or
more sources of legacy data but that it is not responsible for an
organization-wide data conversion effort, e.g. you are not working on an
Enterprise Application Integration (EAI) initiative, although you may be working on
a
data warehouse (DW)/Business Intelligence (BI) team.
That isn't to say that the advice presented below
couldn't be modified for such a situation.
However, because the focus of this method is on
ways of thinking (WoT) and
ways of working (WoW)
that Agile data engineers and application developers can apply when developing business
applications this section will remain consistent with that vision.
The fundamental strategies that you should consider for working with legacy
data for use with your application are:
-
Try to Avoid Legacy Data
- Develop
a Data Error Handling Strategy
-
Work
Iteratively and Incrementally
-
Prefer Read-Only Legacy Data
Access
-
Encapsulate Legacy Data Access
-
Introduce Data
Adapters For
Simple Data Access
-
Introduce a
Staging Database For Complex Data Access
- Adopt
Existing Tools
The simplest solution is to not work with legacy data at
all. If you can avoid working with
legacy data, and therefore avoid the constraints that it places on you, then do
so. There are several strategies that your team may try to apply in order to avoid
working with legacy data, or to at least avoid a complex conversion effort.
The strategies are presented in the order of simplest to most complex:
-
Create your own, stand-alone database.
-
Reprioritize/drop functionality that requires legacy
data access. Your
stakeholders may decide to forgo some functionality that requires legacy
data access when they realize the cost of doing so.
-
Accept legacy data as is.
Your team chooses to directly access the data without a conversion
effort.
-
Refactor
the legacy data source. The
legacy system owners improve the quality of the legacy data source,
allowing your team to work with high-quality legacy data.
An interesting observation is that when you take a big
design up front (BDUF) approach to development where your database schema is
created early in the life of your initiative you are effectively inflicting a
legacy schema on yourself. Don't
do this.
It should be clear by now that you are very likely that you
will discover quality problems with the source data.
When this happens you will want to apply one or more of the
following strategies for handling the error:
-
Convert the faulty data.
-
Drop the faulty data.
-
Log the error.
-
Fix the source data.
Agile software developers work in an
iterative and incremental
(evolutionary)
manner. The really good ones work in a disciplined agile manner. It is possible for data
professionals to also work in this manner but that they must choose to do so.
Agile developers will not attempt to write the data access/conversion
code in one fell swoop.
Instead they will write only the data-oriented code that they
require for the business requirements that they are currently working on.
Therefore their data-oriented code will grow and evolve in an iterative
and incremental fashion, just as the code for the rest of the application
evolves.
Working with legacy data, and in particular converting it into a
cleaner and more usable design, is often viewed by traditional developers as a
large and onerous task. They're
partially right, it is an onerous task but it doesn't have to be a large one.
They're wrong about it being a large task, instead you can break the
problem up into smaller portions and tackle each one at a time.
It's like the old adage "How do you eat an elephant?
One bite at a time". Database
refactoring is a technique for improving the design of a database schema in
such a manner. It is possible to
work iteratively and incrementally when in comes to data-oriented efforts, but
you have to choose to do so. Yes,
many data professionals are more comfortable taking a serial approach to
development but this is simply not an option for modern development efforts.
Choose to try new ways to work.
3.4 Prefer Read-Only Legacy Data Access
It can be exceptionally difficult to address many of the
data quality problems and the
database design problems described earlier when you simply have to read the data.
My experience is that it is often an order of magnitude harder to support
both reading and writing to a legacy data source as compared to just reading
from it. For example, say both
legacy data value X and value Y both map to "fixed" value A.
If your application needs to update the legacy value, what should A be
written back as, X or Y? The
fundamental issue is that to support both read and write data access you need to
define conversion rules for each direction.
Writing data to a legacy data source entails greater risk than simply
reading it because when you write data you must preserve its semantics -
semantics that you may not fully comprehend without extensive analysis of the
other systems that also write to that database.
The implication is that it is clearly to your advantage to avoid updating
legacy data sources whenever possible.
3.5 Encapsulate Legacy Data Access
By
encapsulating database access you reduce coupling with a
database and thus increase its maintainability and flexibility: this is true for the database(s)
you are responsible for and it is true of legacy data sources. You also reduce the burden to your application developers, they only need
to know how to work with the encapsulation strategy and not with all of the
individual data sources. Encapsulating
access to a legacy data source is highly desirable because you do not want to
couple your application code to data-oriented code that will need to evolve as
the legacy data sources evolve. This
can be particularly true when you need to support both read and write access to
legacy data sources and/or when multiple data sources exist.
In simple situations - you have to work with one legacy
data source, you only need a subset of the data, and the data is relatively
clean - then your best option is to introduce a class that accesses the legacy
data. For example, assume you need
access to customer data stored in a legacy database.
The data that you currently require is stored in two
different tables, there are several minor problems with the quality of the data,
and one relatively complicated data quality issue.
You decide to create a class called CustomerDataAdapter that
encapsulates all of the functionality to work with this legacy data.
This class would include the code necessary to read the data, and write
it as well if required.
It would also implement the functionality required to convert
the legacy data into something usable by your business classes, and back again
if need be. When a customer object
requires data it requests it via CustomerDataAdapter, obtaining the data
it needs at the time. If another
type of business class required legacy data, for example the Order class,
then I would implement an OrderDataAdapter to do this - one data
adapter class per business class.
As your initiative progresses you may discover that the data adapter
approach isn't sufficient. Perhaps
your application requires better performance that can only be achieved through a
batch approach to converting the legacy data.
Perhaps there is another data conversion effort in progress within your
organization that you want to take advantage of, one that is based on
introducing a new database schema. Perhaps
your legacy data needs are so complex it has become clear to you that a new
approach is needed.
A
staging database can be introduced for the sole purpose of providing easy access
to legacy data.
The idea is that data converters are written, perhaps by refactoring your
data adapters, to access the data of a single legacy data source, the then
cleanse the data, and finally write it into the staging database.
If the legacy data needs to be updated then similar code needs to be
written to support conversion in the opposite direction.
The main advantage of this approach is that legacy data problems can be
addressed without your application even being aware of them - from the point of
view of your application it's working with nice, clean legacy data.
The main disadvantage is the additional complexity inherent in the
approach.
Your organization may have existing tools and
facilities in place that you can use to access existing legacy data.
For example you may have a corporate license for one or more
Extract-Transform-Load (ETL) tools that are typically used for large-scale data
conversion initiatives. Perhaps other application teams have already written
data adapters or data converters that your team can reuse. In short,
reuse existing resources whenever
possible.
There are several important technologies available to you
for integrating legacy data sources. My
goal here is to make you aware that each one exists, that you have choices
available to you. These
technologies include:
-
Service-based technology.
-
Consolidated database(s).
-
Messaging-based approaches.
-
Common Warehouse Metamodel (CWM).
-
Extensible Markup Language (XML).
When choosing data integration technologies for your
team the most important thing that an Agile data engineer can do is to work with your
enterprise architects and administrators to ensure that your team's choices
reflect the long term architectural vision for your organization.
Ideally this vision is well known already, although when you are working
with new technologies or when your organization is in the process of defining
the vision you may discover that you need to work with enterprise personnel
closely to get this right.
The need to perform legacy analysis is very common on agile teams.
The majority of agile teams work with legacy assets in some way. In fact, the
2013 Agile Project Initiation Survey found that 72% of agile teams
worked with legacy assets in some manner. The
2012 Agility at Scale Survey
, see Figure 2, found that agile teams were both succeeding and failing when faced
with technical complexities such as dealing with legacy data.
Perhaps the teams that ran into trouble didn't heed the advice in this article. ;-)
Figure 2. Agile experiences with technical complexity.

6. What You Have Learned
Working with legacy data is a common, and often very frustrating,
reality of software development. There
are often a wide variety of problems with the legacy data, including data
quality, data design, data architecture, and political/process related issues.
This article
explored these problems in detail, giving you the background
that you require to begin dealing with them effectively.
You were also introduced to a collection of strategies and
technologies for working with legacy data.
The first one is to avoid working with it if possible, why needlessly
suffer these problems? You saw that
working iteratively and incrementally is a viable approach for dealing with
legacy data, the hardest part is to choose to work this way.
Technical solutions were also identified, including the development of
data adapters and staging databases.
Working with legacy data is a difficult task, one that I don't
wish on anyone. Unfortunately we
all have to do it, so it's better to accept this fact, gain the skills that we
need to succeed, and then get on with the work. This article has laid the
foundation from which to gain the skills that you require.