The Process of Database Refactoring: Improve Production Data Quality
Table of Contents
- Database Refactoring
- Why Database Refactoring is Hard
- How to Refactor Your Database
- Adopting Database Refactoring Within Your Organization
- Database Refactoring Best Practices
- Database Refactoring in Context
- Parting Thoughts
Martin Fowler (1999) describes a programming technique called refactoring, a disciplined way to restructure code. The basic idea is that you make small changes to your code to improve your design, making it easier to understand and to modify. Refactoring enables you to evolve your code slowly over time, to take an iterative and incremental approach to programming. Martin’s refactoring site, www.refactoring.com, is a good online resource.
A critical aspect of a refactoring is that it retains the behavioral semantics of your code, at least from a black box point of view. For example there is a very simple refactoring called Rename Method, perhaps from getPersons() to getPeople(). Although this change looks easy on the surface you need to do more than just make this single change, you must also change every single invocation of this operation throughout all of your application code to invoke the new name. Once you’ve made these changes then you can say you’ve truly refactored your code because it still works again as before.
It is important to understand that you do not add functionality when you are refactoring. When you refactor you improve existing code, when you add functionality you are adding new code. Yes, you may need to refactor your existing code before you can add new functionality. Yes, you may discover later on that you need to refactor the new code that you just added. The point to be made is that refactoring and adding new functionality are two different but complementary tasks.
2. Database Refactoring
In the February 2002 issue of Software Development I described a technique that I called data refactoring. This article described my preliminary experiences at something that should more appropriately have been called database refactoring in hindsight. Hence the new name. From this point forward I’ll use the term code refactoring to refer to traditional refactoring as described by Fowler to distinguish it from database refactoring.
Let’s start with some definitions. A database refactoring is a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics in a practical manner. For the sake of this discussion a database schema includes both structural aspects such as table and view definitions as well as functional aspects such as stored procedures and triggers. An interesting thing to note is that a database refactoring is conceptually more difficult than a code refactoring; code refactorings only need to maintain behavioral semantics while database refactorings also must maintain informational semantics.
There is a database refactoring named Split Column, one of many described in A Catalog of Database Refactorings, where you replace a single table column with two or more other columns. For example you are working on the Person table in your database and discover that the FirstDate column is being used for two distinct purposes – when the person is a customer this column stores their birth date and when the person is an employee it stores their hire date. Your application now needs to support people who can be both a customer and an employee so you’ve got a problem. Before you can implement this new requirement you need to fix your database schema by replacing the FirstDate column with BirthDate and HireDate columns. To maintain the behavioral semantics of your database schema you need to update all source code that accesses the FirstDate column to now work with the two new columns. To maintain the informational semantics you will need to write a migration script that loops through the table, determines the type, then copies the existing date into the appropriate column. Although this sounds easy, and sometimes it is, my experience is that database refactoring is incredibly difficult in practice when cultural issues are taken into account (more on this later).
2.1 Why Database Refactoring?
Agile methodologies such as Scrum and Disciplined Agile Delivery (DAD) take an iterative and incremental approach to software development. Application developers on agile teams typically forsake big design up front (BDUF) approaches in favor of emergent approaches where the design of a system evolves throughout the life of the initiative. On an agile development initiative the final design often isn’t known until the application is ready to be released. This is a very different way to work for many experienced IT professionals.
The implication is that the traditional approach of creating a (nearly) complete set of logical and physical data models up front isn’t going to work. The main advantage of the traditional approach is that it makes the job of the data engineer much easier – the data schema is put into place early and that’s what people use. However there are several disadvantages. First, it requires the designers to get it right early, forcing you to identify most requirements even earlier in the initiative, and therefore forcing your team into taking a serial approach to development. Second, it doesn’t support change easily. As things progress your stakeholders understanding of what they need will evolve, motivating them to evolve their requirements. The business environment will also change during your effort, once again motivating your stakeholders to evolve their requirements. In short the traditional way of working simply doesn’t work well in an agile environment. If agile data engineers are going to work on and support teams that are following agile methodologies they need to find techniques that support working iteratively and incrementally. My experience is that one critical technique is database refactoring.
There are three fundamental reasons why you should be interested in database refactoring:
- To safely fix existing legacy databases. The bottom line is that legacy databases aren’t going to fix themselves, and that from a technical point of view database refactoring is a safe, simple way to improve data, and database, quality over time. My various surveys into data quality have consistently shown over the years that organizations suffer from data quality issues but in many cases do not have a realistic strategy in place to address them.
- To support evolutionary development. Modern software development processes, such as DAD, XP, and Scrum, all work in an evolutionary manner. Data professionals need to adopt techniques, including this one, which enable them to work in such a manner.
- To tune your database.. Part of your tuning efforts may include the (de)normalization of the schema.
2.2 Preserving Semantics
Informational semantics refers to the meaning of the information within the database from the point of view of the users of that information. To preserve the informational semantics implies that when you change the values of the data stored in a column the clients of that information shouldn’t be affected by the improvement. Similarly, with respect to behavioral semantics the goal is to keep the black box functionality the same – any source code that works with the changed aspects of your database schema must be reworked to accomplish the same functionality as before.
2.3 What Database Refactorings Aren’t
A small transformation to your schema to extend it, such as the addition of a new column or table, is not a database refactoring because the change extends your design. A large number of small changes simultaneously applied to your database schema, such as the renaming of ten columns, would not be considered a database refactoring because this isn’t a single, small change. Database refactorings are small changes to your database schema that improve its design while preserving the behavioral and informational semantics. That’s it. I have no doubt that you can make those changes to your schema, and you may even follow a similar process, but they’re not database refactorings.
3. Why Database Refactoring is Hard
Coupling. As you learned in Relational Databases 101 coupling is a measure of the degree of dependence between two items – the more highly coupled two things are the greater the chance that a change in one will require a change in another. Coupling is the “root of all evil” when in comes to database refactoring, the more things that your database schema is coupled to the harder it is to refactor. Unfortunately you learned in Relational Databases 101 that relational database schemas are potentially coupled to a wide variety of things:
- Your application source code
- Other application source code
- Data load source code
- Data extract source code
- Persistence frameworks/layers
- Your database schema (captured via models or scripts)
- Data migration scripts
- Test code
- Models and/or documentation
Figure 1 depicts the best-case scenario for database refactoring – when it is only your application code that is coupled to your database schema. Then Figure 2 depicts the worst-case scenario for database refactoring efforts where a wide variety of software systems are coupled to your database schema, a situation that is quite common with existing production databases.
Figure 1. The best-case scenario.
Figure 2. The worst-case scenario.
4. The Process of Database Refactoring
Before I describe the steps for refactoring a database I need to address a critical issue – Does the simple situation depicted in Figure 1 imply you’ll do different things than the highly coupled one of Figure 2? Yes and no. The fundamental process itself remains the same although the difficulty of implementing individual database refactorings increases dramatically as the coupling your database is involved with increases. If you find yourself in the simple situation then you will not need to do the transition period work described below, you can simply refactor your database schema and application code in parallel and deploy them simultaneously. People who find themselves in the more complex situation do not have this luxury.
This section is written under the assumption that your technical and cultural environments are organized to support database refactoring. Although this sounds like a big assumption, and it is, I will describe what you need to do to get to the point where these environments are in fact in place. Anything less would be inappropriate.
I like to think of database refactoring as a three-step process:
4.1 Step 1: Start In Your Development Sandbox
Your development sandbox is the technical environment where your software, including both your application code and database schema, is developed and unit tested. The need to refactor your database schema is typically identified by an application developer who is trying to implement a new requirement or who is fixing a defect. For example, a developer may need to extend their application to accept Canadian mailing addresses in addition to American addresses. The main difference is that Canadian addresses have postal codes such as R2D 2C3 instead of zip codes such as 90210-1234. Unfortunately the ZipCode column of the SurfaceAddress table is numeric and therefore will not currently support Canadian postal codes. The application developer describes the needed change to one of the Agile data engineer(s) on their team and the database refactoring effort begins.
As depicted in Figure 3, the Agile data engineer and application developer will typically work through some or all of the following steps to implement the refactoring:
- Verify that a database refactoring is required
- Choose the most appropriate database refactoring
- Deprecate the original schema
- Write unit tests
- Modify the database schema
- Migrate the source data
- Update external access programs
- Update your data migration script(s)
- Run your regression tests
- Announce the refactoring
- Version control your work
Figure 3. The process of database refactoring.
4.1.1 Verify that a Database Refactoring is Required
The first thing that the Agile data engineer does is they’ll try to determine if the database refactoring even needs to occur and if it is the right one to perform. The second thing that the Agile data engineer does is internally assess the likeliness that the change is actually needed. This is usually a “gut call” based on the Agile data engineer’s previous experience with the application developer. The next thing the Agile data engineer does is to assess the overall impact of the refactoring. In the stovepipe situation of Figure 1 this is fairly straightforward because the Agile data engineer should have an understanding of how the application is coupled to this part of the database. When this isn’t the case they’ll need to work with the application developer to do so.
In the complex case of Figure 2 the Agile data engineer will need to have an understanding of the overall technical infrastructure within your organization and how the other applications are coupled to your database. This is knowledge that they’ll need to build up over time by working with the enterprise architects, agile developers, and even other Agile data engineers. When the Agile data engineer isn’t sure of the impact they will either need to decide to make a decision at the time and go with their gut feel or decide to advise the application developer to wait while they talk to the right people. The goal of this effort is to make sure that you attempt a database refactoring that you aren’t likely going to be able to do – if you are going to need to update, test, and redeploy twenty other applications to make this refactoring then it likely isn’t viable for you to continue.
4.1.2 Choose The Most Appropriate Database Refactoring
An important skill that Agile data engineers require is the understanding that you typically have several choices for implementing new data structures and new logic within a database.
4.1.3 Write Your Unit Tests
Like code refactoring, database refactoring is enabled by the existence of a comprehensive test suite – you know you can safely change your database schema if you can easily validate that the database still works after the change. Here’s my advice:
- Your primary goal is to ensure that the tests exist.
- You should try to have each test implemented once, either at the application level or at the database level but not both.
- Some unit tests will be at the application level and some at the database level, and that’s ok.
- Go for the lowest common denominator – if the database is accessed by several applications thed any data-related tests should appear in your database test suite, helping to ensure they’re tested once.
- When you have a choice, implement the test at the level where you have the best testing tools (often at the application level). Testing tools are discussed in the Tools article.
The article Database Regression Testing describes database testing in detail.
4.1.4 Deprecate The Original Schema
An effective technique that Pramod Sadalage and Peter Schuh (2002) promote is a deprecation period, although transition period is a better term, for the original portion of the schema that you’re changing. They observe that you can’t simply make the change to your database schema instantly, that instead you need to work with both the old and the new schema in parallel for awhile to provide time for the other application teams to refactor and redeploy their systems.
Figure 4 shows how this idea would work when we apply the Replace Column database refactoring to ZipCode (this example was created in 2003, hence a removal date in 2007 – more on this later). Notice the changes between the original schema and the schema during the transition period. PostCode has been added as a column, exactly what you would expect. The ZipCode column has been marked as deprecated – you know this because a removal date has been assigned to it using a UML named variable. A trigger was also introduced to keep the values contained in the two columns synchronized, the assumption being that new application code will work with PostCode but should not be expect to keep ZipCode up to date, and that older application code that has not been refactored to use the new schema won’t know to keep PostCode up to date. This trigger is an example of database scaffolding code, simple and common code that is required to keep your database “glued together”. This code has been assigned the same removal date as ZipCode.
Figure 4. Refactoring the Address table.
An interesting thing to notice about Figure 4 is the addition of the Country column to Address. Wait a minute, there isn’t an Add Column database refactoring in the catalog. Have we found a new type of database refactoring? No. Database refactorings are small changes to database schemas that IMPROVE their design, not simply CHANGE the design. Adding a new column is a transformation to the schema but not a design improvement to it. Although this is clearly a very small nuance I believe that it’s an important one.
To see how to implement the code for a database refactoring, read The Rename Column Database Refactoring.
Figure 5 depicts the lifecycle of a database refactoring. You first implement it within the scope of your initiative, and if successful you eventually deploy it into production. During the transition period both the original schema and the new schema exist, with sufficient scaffolding code to ensure that any updates are correctly supported. During the transition period some applications will work with PostCode and others with ZipCode, but not both at once. Regardless of which column they work with, the applications should all run properly. Once the deprecation period has expired the original schema plus any scaffolding code is removed and the database retested. At this point in time the assumption is that all applications work with PostCode.
Figure 5. The lifecycle of a database refactoring.
- Coordinate. Implement the refactoring, coordinating with the appropriate people at the organizational level (likely your data management group) to ensure it gets into your overall change process.
- Collaborate. Teams across your organization work together to change anyhing coupled to the database (see Figure 2).
- Conclude. The original schema and scaffolding code are removed, effectively concluding the refactoring.
4.1.5 Modify the Database Schema
The application developer and Agile data engineer work together to make the changes within the development sandbox. The strategy is to start each refactoring simply, by performing the refactoring within the development sandbox first you are effectively putting yourself in the situation described in Figure 1.
To do this, you need to update two logs (assuming you don’t have a database refactoring tool which does this automatically):
- Database change log. This is the source code implementing all database schema changes in the order that they were applied throughout the course of an initiative. When you are implementing a database refactoring, you include only the immediate changes in this log. When applying the Replace Column database refactoring we would include the DDL for adding the PostCode column and the data definition language (DDL) to implement the trigger(s) to maintain the values between the PostCode and ZipCode columns during the transition period.
- Update log. This log contains the source code for future changes to the database schema that are to be run after the transition period for database refactorings. In our example this would be the source code required to remove the ZipCode column and the triggers we introduced.
4.1.6 Migrate the Data
Many refactorings require you to migrate, or copy data, from the old version of the schema to the new. Your data migration log contains the data manipulation language (DML) to reformat or cleanse the source data throughout the course of your effort. In our example this would include any code to improve the quality of the values in the ZipCode column.
4.1.7 Update External Programs
The programs which access the portion of the database schema which you’re refactoring must be updated to work with the new version of the database schema. All of these programs must be refactored and then deployed in production before the transition period expires, as implied in Figure 5.
4.1.7 Run Your Regression Tests
Once the changes to your application code and database schema have been put in place you then need to run your regression test suite. Because successful tests discover problems you will need to rework things until you get it right. A significant advantage of database refactorings being small changes is that if your tests do in fact break you’ve got a pretty good idea where the problem lies – in the application code and database schema that you just changed. The larger your changes are the more difficult it becomes to track down problems, and therefore the slower and less effective your development efforts are. You’ll discover that developing in small, incremental steps works incredibly well in practice.
4.1.8 Announce the Database Refactoring
Because your database is a shared resource, minimally it is shared within your application development team if not by several application teams, the Agile data engineer needs to communicate the changes that have been made. If you haven’t already done so you should update the physical data model (PDM) for your database. I personally have a tendency to model the new schema in a PDM tool such as ERWin and then generate the initial DDL that I’ll then modify and include in my database change scripts.
4.1.10 Version Control Your Work
A critical skill for agile developers is the habit of putting all of their work under configuration management (CM) control by checking it into a version control tool. In the case of database refactoring this includes any DDL that you’ve created, change scripts, data migration scripts, test data, test cases, test data generation code, documentation, and models. This is in addition to the application-oriented artifacts that you would normally version – treat your database-oriented artifact the exact same way that you’d treat other development artifacts and you should be ok.
4.2 Step 2: Implement The Database Refactoring In Your Integration Sandbox
After several days have passed you will be ready to implement your database refactoring within your team integration sandbox. The reason why you need to wait to do so is to give your teammates time to refactor their own code to use the new schema.
Teams that have chosen to encapsulate access to their database via the use of a persistence framework will find it easier to react to database schema changes and therefore may discover they can tighten up the period between implementing a database refactoring within a development sandbox and in their team integration sandbox. This is due to the fact that the database schema is represented in meta data, therefore many database schema changes will only require updates to the meta data and not to the actual source code.
To deploy into each sandbox you will need to both build your application and run your database management scripts (tools such as Autopatch can help). The next step is to rerun your regression tests to ensure that your system still works – if not you will need to fix it in your development environment, redeploy, and retest. The goal in your team integration sandbox is to validate that the work of everyone on your team functions when put together, whereas your goal in the Test/QA sandbox is to validate that your system works well with the other systems within your organization.
Communication is a critical part of deploying database refactorings into your Test/QA sandbox, I’m using the plural now because you typically introduce several database changes (refactorings) into this environment at once. Long before you change your database schema you need to communicate and negotiate the changes with the owners of all of the other applications that access your database. Your operations engineers (if any) will be involved in this negotiation, they may even facilitate the effort, to ensure that the overall needs of your organization are met. Luckily the process that you followed in your development sandbox has made this aspect of database refactoring easier:
- The Agile data engineer only allowed database refactorings that can realistically be implemented – if another application team isn’t going to be able to rework their code to access the new schema then you can’t make the change.
- The documentation, even if it’s only a brief description of each change, that the Agile data engineer wrote is important because it provides an overview of the changes that are about to be deployed.
- The new version physical data model (PDM), which was updated as database refactorings were implemented, serves as a focal point for the negotiations with other teams. Agile Modeling (AM) would consider the PDM to be a “contract model” that your team has with the other application teams, a model that they can count on to be accurate and that they can count on being actively involved in negotiating changes to it.
4.3 Step 3: Install The Database Refactoring Into Production
Installing into production is the hardest part of database refactoring, particularly in the complex situation of Figure 2. You generally won’t deploy database refactorings on their own, instead you will deploy them as part of the overall deployment of one or more systems. Deployment is easiest when you have one application and one database to update, and this situation does occur in practice, but realistically we need to consider the situation where you are deploying several systems and several data sources at once. Figure 6 overviews the steps of deploying your refactorings into production.
Figure 6. The steps of deploying your database refactorings.
5. Adopting Database Refactoring Within Your Organization
Although the adoption of effective tools is an important part of enabling database refactoring it is only the tip of the iceberg – database refactoring requires a significant cultural change within your organization. Because database refactoring is an enabling technique of the Agile Data method many of the cultural issues for adopting database refactoring are the same ones that you face adopting the Agile Data method in general. These cultural issues include a serial mindset within many data professionals, resistance to change, and political inertia. The following approach should help you to overcome these challenges:
- Start simple. Database refactoring is easiest in greenfield environments where a new application accesses a new database, and the next easiest situation is when a single application accesses a legacy database. Both of these scenarios are typified by Figure 1. By starting simple you provide yourself with an environment in which you can learn the basics, once you understand the basics you are in a much better position to tackle the situations typified by Figure 2.
- Accept that iterative and incremental development is the norm. Modern software development methodologies take an iterative and incremental approach to software development. Although serial development is often the preferred approach by many data professionals unfortunately it doesn’t reflect the current way that application developers work. Time to change.
- Accept that there is no magic solution to get you out of your existing mess. Your data quality problems didn’t create themselves and they are not going to fix themselves. People created the problem and people need to fix the problem. Database refactoring is the safest and most straightforward strategy available to you to dig your way out of your data technical debt.
- Adopt a 100% database regression testing policy. For database refactoring to work, and in general for iteratively and incremental development to work, you need to be effective at regression testing. To be successful at database refactoring you need to not only be able to regression test the database itself but any application that is coupled to your database. The implication is that you require regression test suites for every single application, something you very likely do not have. So start writing them.
- Explore the technique. Experiment with database refactoring in simple situations first to learn the technique and gain some experience. Then start refactoring more complicated things. So, start simple.
Database refactoring works in practice, it isn’t simply just another academic theory. For the vast majority of organizations this is a new, “bleeding edge” technique.
6. Database Refactoring Strategies
Fowler (1999) suggests a collection of practices for code refactoring, practices that I recast below for database refactoring:
- Refactor to ease additions to your schema.
- Ensure the test suite is in place.
- Take small steps.
- Program for people.
- Don’t publish data models prematurely.
- The need to document reflects a need to refactor.
- Test frequently.
7. Database Refactoring in Context
The following table summarizes the trade-offs associated with database refactoring and provides advice for when (not) to adopt it.
|When to Adopt This Practice||You need to have a reasonably complete automated regression test suite in place, one that at least validates the critical and high-risk aspects of your database, for database refactoring to be a viable option.|
8. Parting Thoughts
Database refactoring supports an incremental approach to the evolution of your database schema, one of the three fundamental strategies (you can give up, take a “big bang release” approach, take an incremental approach). Each strategy has its unique strengths and weaknesses. I suspect that many organizations, perhaps because of a serial mindset, have either tried the big-bang release approach or have been too scared to do so and have now given up. It doesn’t have to be this way. Yes, it will likely take a significant effort for your organization to put the culture and technologies in place to support database refactoring across your enterprise, but in the long run this is likely far more palatable than your other alternatives.
See the Catalog of Database Refactorings.
Regardless of your strategy database evolution is hard, something that is particularly true when your database is highly coupled to other things. Database refactoring is not a silver bullet that’s going to magically solve all of your database problems. This article described how to successfully approach database refactoring within a simple, stovepipe environment.
9. Related Resources
This book describes, in detail, how to refactor a database schema to improve its design. The first section of the book overviews the fundamentals evolutionary database techniques in general and of database refactoring in detail. More importantly it presents strategies for implementing and deploying database refactorings, in the context of both “simple” single application databases and in “complex” multi-application databases. The second section, the majority of the book, is a database refactoring reference catalog. It describes over 60 database refactorings, presenting data models overviewing each refactoring and the code to implement it.