Sometimes you are in a position to develop your data schema from scratch when you are developing a new system using object-oriented technologies. If so, consider yourself amongst the lucky few because the vast majority of developers are often forced to tolerate one or more existing legacy data designs. Worse yet, it is often presumed that these data sources cannot be improved because of the corresponding changes that would be required to the legacy applications that currently access them. The problems presented by legacy data sources are often too difficult to fix immediately, therefore you have to learn to work around them.
The goal of this article is to introduce both application developers and Agile data engineers to the realities of working with legacy data. For our purposes any computer artifact, including but not limited to data and software, is considered to be a legacy asset once it is deployed and in production. For example, the C# application and its XML database that you deployed last week are now considered to be legacy assets even though they are the built from the most modern technologies within your organization. A legacy data source is any file, database, or software asset (such as a web service or business application) that supplies or produces data and that has already been deployed. For the sake of brevity we will only focus on the data aspects of legacy software assets.
The need to work with legacy data constrains a development team. It reduces their flexibility because they cannot easily manipulate the source data schema to reflect the needs of their object schema (see Mapping Objects to RDBs). Legacy data often doesn’t provide the full range of information required by the team because the data does not reflect their new requirements. Legacy data is often constrained itself by the other applications that work with it, constraints that are then put on your team. Legacy data is often difficult to work with because of a combination of quality, design, architecture, or political issues.
Table of Contents
- Sources of Legacy Data
- Common Problems With Legacy Data
- Strategies for Working With Legacy Data
- Data Integration Technologies
- What You Have Learned
Where does legacy data come from? Virtually everywhere. Figure 1 indicates that there are many sources from which you may obtain legacy data. This includes existing databases, often relational, although non-RDBs such as hierarchical, network, object, XML, object/relational databases, and NoSQL databases. Files, such as XML documents or “flat files” such as configuration files and comma-delimited text files, are also common sources of legacy data. Software, including legacy applications that have been wrapped (perhaps via CORBA) and legacy services such as web services or CICS transactions, can also provide access to existing information. The point to be made is that there is often far more to gaining access to legacy data than simply writing an SQL query against an existing relational database.
What type of problems are you likely to experience with legacy data? There are three technical issues, all of which contribute to your organization’s technical debt (arguably technical debt associated with legacy data sources should be referred to as data debt), and one non-technical issue to be concerned with:
- Data quality challenges
- Database design problems
- Data architecture problems
- Process-related challenges
Table 1 lists the most common problems that you may encounter, indicating potential database refactorings (see the database refactoring catalog) that you could apply to resolve the problem. It is important to understand that any given data source may suffer from several of these problems, and sometimes a single data column/field may even experience several problems.
Agile data engineers will work with application programmers to identify their data needs, to then identify potential sources for that data, and in the case of legacy data to help them to access that data. Part of the job of accessing the data is to help application developers to transform and cleanse the data to make it usable. Agile data engineers will be aware of the potential problems summarized in Table 1 and will work closely with the application programmers to overcome the challenges.
|Problem||Potential Database Refactorings|
|A single column is used for several purposes||Split Column (to Notes)|
|The purpose of a column is determined by the value of one or more other columns||Remove Unused Column (to remove DateType) Split Column (to PersonDate)|
|Inconsistent data values||Introduce Trigger(s) for Calculated Column (between BirthDate and AgeInYears) Remove Redundant Column (to AgeInYears)|
|Inconsistent/incorrect data formatting||Introduce Common Format|
|Additional columns||Introduce Default Value to a Column
Remove Redundant Column
|Multiple sources for the same data||N/A|
|Important entities, attributes, and relationships are hidden and floating in text fields||Replace Blob With Table
|Data values that stray from their field descriptions and business rules||Split Column|
|Various key strategies for the same type of entity||Consolidate Key Strategy For Entity|
|Unrealized relationships between data records||Introduce Explicit Relationship|
|One attribute is stored in several fields||Combine Columns Representing a Single Concept|
|Inconsistent use of special characters||Introduce Common Format|
|Different data types for similar columns||Apply Standard Types to Similar Data|
|Different levels of detail||Introduce Calculated Column Replace Column|
|Different modes of operation||Separate Read-Only Data|
|Varying timeliness of data||Separate Data Based on Timeliness|
|Varying default values||Introduce Default Value to a Column|
|Various representations||Apply Standard Codes Apply Standard Types to Similar Data|
The second problem with legacy data sources that Agile data engineers need to be aware of are fundamental design problems. Existing data designs, or even new data designs, are rarely perfect and often suffer from significant challenges.
Common data design problems you will likely discover:
- Database encapsulation scheme exists, but it’s difficult to use
- Ineffective (or no) naming conventions
- Inadequate documentation
- Original design goals at odds with current needs
- Inconsistent key strategy
These design problems may be the result of poor database design in the first place, perhaps the designers did not have a very good understanding of data modeling . Sometimes the initial design of a data source was very good but over time the quality degraded as ill-advised schema changes were made, something referred to as schema entropy. Once again, the Agile data engineer will need to work closely with application programmers to overcome these problems. Their past experience dealing with similar design problems, as well as their personal relationship with the owners of the legacy data source(s), will prove to be a valuable asset to the team.
Agile data engineers need to be aware of the problems with the data architecture within your enterprise, information that they will often gain through discussions with enterprise architects. These problems typically result from development teams not conforming to an enterprise architectural vision (such a vision seldom exists) or because the team simply wasn’t aware of data architectural issues. Some of the potential data architecture problems that you may discover include:
- Applications responsible for data cleansing
- Different database paradigms
- Different hardware platforms
- Different storage devices
- Fragmented data sources
- Inaccessible data
- Inconsistent semantics
- Inflexible architecture
- Lack of event notification
- Redundant data sources
- No or inefficient security
- Varying timeliness of data sources
A common implication of these architecture problems is that you need to put an effective data access approach in place such as introducing a staging database or a robust data encapsulation strategy. Staging databases are discussed below and encapsulation strategies are covered in another chapter.
The technical challenges associated with legacy data are bad enough, although unfortunately non-technical ones often overshadow them. The most difficult aspect of software development is to get people to work together effectively, and dealing with legacy data is no exception. Organizations will often hobble development teams because they are unable, or unwilling, to define and then work towards an effective vision. When it comes to working with legacy data there are several common process-oriented mistakes that I have seen organizations make:
- Working with legacy data when you don’t need to.
- Data design drives your object model.
- Legacy data issues overshadow everything else.
- Application developers ignore legacy data issues.
- You choose to not refactor the legacy data source.
- You don’t see the software forest for the legacy data trees.
- You don’t put contract models in place.
My assumption in this section is that your team needs to access one or more sources of legacy data but that it is not responsible for an organization-wide data conversion effort, e.g. you are not working on an Enterprise Application Integration (EAI) initiative, although you may be working on a data warehouse (DW)/Business Intelligence (BI) team. That isn’t to say that the advice presented below couldn’t be modified for such a situation. However, because the focus of this method is on ways of thinking (WoT) and ways of working (WoW) that Agile data engineers and application developers can apply when developing business applications this section will remain consistent with that vision.
The fundamental strategies that you should consider for working with legacy data for use with your application are:
- Try to Avoid Legacy Data
- Develop a Data Error Handling Strategy
- Work Iteratively and Incrementally
- Prefer Read-Only Legacy Data Access
- Encapsulate Legacy Data Access
- Introduce Data Adapters For Simple Data Access
- Introduce a Staging Database For Complex Data Access
- Adopt Existing Tools
The simplest solution is to not work with legacy data at all. If you can avoid working with legacy data, and therefore avoid the constraints that it places on you, then do so. There are several strategies that your team may try to apply in order to avoid working with legacy data, or to at least avoid a complex conversion effort. The strategies are presented in the order of simplest to most complex:
- Create your own, stand-alone database.
- Reprioritize/drop functionality that requires legacy data access. Your stakeholders may decide to forgo some functionality that requires legacy data access when they realize the cost of doing so.
- Accept legacy data as is. Your team chooses to directly access the data without a conversion effort.
- Refactor the legacy data source. The legacy system owners improve the quality of the legacy data source, allowing your team to work with high-quality legacy data.
An interesting observation is that when you take a big design up front (BDUF) approach to development where your database schema is created early in the life of your initiative you are effectively inflicting a legacy schema on yourself. Don’t do this.
It should be clear by now that you are very likely that you will discover quality problems with the source data. When this happens you will want to apply one or more of the following strategies for handling the error:
- Convert the faulty data.
- Drop the faulty data.
- Log the error.
- Fix the source data.
Agile software developers work in an iterative and incremental (evolutionary) manner. The really good ones work in a disciplined agile manner. It is possible for data professionals to also work in this manner but that they must choose to do so. Agile developers will not attempt to write the data access/conversion code in one fell swoop. Instead they will write only the data-oriented code that they require for the business requirements that they are currently working on. Therefore their data-oriented code will grow and evolve in an iterative and incremental fashion, just as the code for the rest of the application evolves.
Working with legacy data, and in particular converting it into a cleaner and more usable design, is often viewed by traditional developers as a large and onerous task. They’re partially right, it is an onerous task but it doesn’t have to be a large one. They’re wrong about it being a large task, instead you can break the problem up into smaller portions and tackle each one at a time. It’s like the old adage “How do you eat an elephant? One bite at a time”. Database refactoring is a technique for improving the design of a database schema in such a manner. It is possible to work iteratively and incrementally when in comes to data-oriented efforts, but you have to choose to do so. Yes, many data professionals are more comfortable taking a serial approach to development but this is simply not an option for modern development efforts. Choose to try new ways to work.
It can be exceptionally difficult to address many of the data quality problems and the database design problems described earlier when you simply have to read the data. My experience is that it is often an order of magnitude harder to support both reading and writing to a legacy data source as compared to just reading from it. For example, say both legacy data value X and value Y both map to “fixed” value A. If your application needs to update the legacy value, what should A be written back as, X or Y? The fundamental issue is that to support both read and write data access you need to define conversion rules for each direction. Writing data to a legacy data source entails greater risk than simply reading it because when you write data you must preserve its semantics – semantics that you may not fully comprehend without extensive analysis of the other systems that also write to that database. The implication is that it is clearly to your advantage to avoid updating legacy data sources whenever possible.
By encapsulating database access you reduce coupling with a database and thus increase its maintainability and flexibility: this is true for the database(s) you are responsible for and it is true of legacy data sources. You also reduce the burden to your application developers, they only need to know how to work with the encapsulation strategy and not with all of the individual data sources. Encapsulating access to a legacy data source is highly desirable because you do not want to couple your application code to data-oriented code that will need to evolve as the legacy data sources evolve. This can be particularly true when you need to support both read and write access to legacy data sources and/or when multiple data sources exist.
In simple situations – you have to work with one legacy data source, you only need a subset of the data, and the data is relatively clean – then your best option is to introduce a class that accesses the legacy data. For example, assume you need access to customer data stored in a legacy database. The data that you currently require is stored in two different tables, there are several minor problems with the quality of the data, and one relatively complicated data quality issue. You decide to create a class called CustomerDataAdapter that encapsulates all of the functionality to work with this legacy data. This class would include the code necessary to read the data, and write it as well if required. It would also implement the functionality required to convert the legacy data into something usable by your business classes, and back again if need be. When a customer object requires data it requests it via CustomerDataAdapter, obtaining the data it needs at the time. If another type of business class required legacy data, for example the Order class, then I would implement an OrderDataAdapter to do this – one data adapter class per business class.
As your initiative progresses you may discover that the data adapter approach isn’t sufficient. Perhaps your application requires better performance that can only be achieved through a batch approach to converting the legacy data. Perhaps there is another data conversion effort in progress within your organization that you want to take advantage of, one that is based on introducing a new database schema. Perhaps your legacy data needs are so complex it has become clear to you that a new approach is needed.
A staging database can be introduced for the sole purpose of providing easy access to legacy data. The idea is that data converters are written, perhaps by refactoring your data adapters, to access the data of a single legacy data source, the then cleanse the data, and finally write it into the staging database. If the legacy data needs to be updated then similar code needs to be written to support conversion in the opposite direction. The main advantage of this approach is that legacy data problems can be addressed without your application even being aware of them – from the point of view of your application it’s working with nice, clean legacy data. The main disadvantage is the additional complexity inherent in the approach.
Your organization may have existing tools and facilities in place that you can use to access existing legacy data. For example you may have a corporate license for one or more Extract-Transform-Load (ETL) tools that are typically used for large-scale data conversion initiatives. Perhaps other application teams have already written data adapters or data converters that your team can reuse. In short, reuse existing resources whenever possible.
There are several important technologies available to you for integrating legacy data sources. My goal here is to make you aware that each one exists, that you have choices available to you. These technologies include:
- Service-based technology.
- Consolidated database(s).
- Messaging-based approaches.
- Common Warehouse Metamodel (CWM).
- Extensible Markup Language (XML).
When choosing data integration technologies for your team the most important thing that an Agile data engineer can do is to work with your enterprise architects and administrators to ensure that your team’s choices reflect the long term architectural vision for your organization. Ideally this vision is well known already, although when you are working with new technologies or when your organization is in the process of defining the vision you may discover that you need to work with enterprise personnel closely to get this right.
Working with legacy data is a common, and often very frustrating, reality of software development. There are often a wide variety of problems with the legacy data, including data quality, data design, data architecture, and political/process related issues. This article explored these problems in detail, giving you the background that you require to begin dealing with them effectively.
You were also introduced to a collection of strategies and technologies for working with legacy data. The first one is to avoid working with it if possible, why needlessly suffer these problems? You saw that working iteratively and incrementally is a viable approach for dealing with legacy data, the hardest part is to choose to work this way. Technical solutions were also identified, including the development of data adapters and staging databases.
Working with legacy data is a difficult task, one that I don’t wish on anyone. Unfortunately we all have to do it, so it’s better to accept this fact, gain the skills that we need to succeed, and then get on with the work. This article has laid the foundation from which to gain the skills that you require.