Agile Data Warehousing: A Disciplined Approach
This article overviews a disciplined approach to agile data warehousing (DW). The focus of this article is on the process itself, as opposed to specific architecture and design techniques (for those I highly suggest Data Vault 2). Furthermore, this topic is clearly worthy of a book containing detailed descriptions of the techniques and artifacts described below (however, I have included numerous links to such details if you’re willing to explore further on your own).
This article addresses:
- Introduction to Agile DW/BI
- Agile Data Warehousing
- Artifact Creation by Agile DW Teams
- Related Resources
1. Introduction to Agile Data Warehousing/Business Intelligence
Many organizations start their agile journey by adopting Scrum. Scrum describes a good strategy for leading agile software teams but is only part of what is required to deliver sophisticated solutions to your stakeholders. Invariably teams need to look to other methods to fill in the process gaps that Scrum purposely ignores. When looking at other methods there is considerable overlap and conflicting terminology that can be confusing to practitioners as well as outside stakeholders. Worse yet people don’t always know where to look for advice or even know what issues they need to consider.
To address these challenges this article describes how to take an agile approach to agile data warehouse (DW)/business intelligence (BI) development.
1.1 Agile Data Roles
Figure 1 summarizes the roles that are pertinent to the Agile Data method. They are organized into two categories:
- Primary roles. These roles are the focus of the Agile Data method.
- Supporting roles. People in these roles collaborate with, and support the work of, people in the primary roles. These roles are adopted from other methods such as Scrum or Agile Modeling.
Figure 1. The Roles of the Agile Data Method (click to enlarge).
There are several important observations about these agile data warehousing roles:
- These roles are different from traditional roles. Moving to an agile way of working requires a paradigm shift. Part of that paradigm shift is an improved set of roles and responsibilities held by people on agile teams. As a professional you need to be prepared to work in an agile manner to fit into an agile team.
- The era of the specialist is over. Key tenets of agile development are that people work together collaboratively to produce a working solution in an evolutionary (iterative and incremental) manner. An implication of that is that we’re no longer able to work in a manner where people with narrow specialties – such as logical data modeler, physical data modeler, data analyst, and data architect – each do their small part of the work and then hand it off to the next person. That traditional approach is too slow, expensive, and error prone. Instead we build teams of “T-skilled” generalizing specialists who have one or more specialties (such as the ones listed above) PLUS a broader set of skills and knowledge that allow them to take on a wider range of tasks and work more effectively with their team mates.
- There is room for you if you’re willing to learn. Although these roles are different than what you may be used to it is still possible, and highly desirable, for you to Deploy into one of these roles.
- The need for specialists still exist. There is a very small range of situations where specialists are still needed. Having said that, it is highly unlikely that you’re in one of those situations. If you are new to agile, your best strategy is to assume that you are not in one of those situations and that you need to start working on becoming a generalizing specialist (not a generalist!) just like the vast majority of people.
1.2 Be Enterprise Aware
“Be enterprise aware” is one of the ways of thinking (WoT) of the Agile Data (AD) method. The observation is that DA teams work within your organization’s enterprise ecosystem, as do all other teams. There are often existing systems currently in production and minimally your solution shouldn’t impact them. Better yet your solution will hopefully leverage existing functionality and data available in production. You will often have other teams working in parallel to your team, and you may wish to take advantage of a portion of what they’re doing and vice versa. Your organization may be working towards business or technical visions which your team should contribute to. A governance strategy exists which hopefully enhances what your team is doing.
It is important that data warehousing teams work in an enterprise aware manner for several reasons. You should:
- Adopt your organization’s development and data conventions. The implication is that your delivery team will need to work closely with your organization’s enterprise architecture and data management teams who are typically responsible for such conventions.
- Leverage existing infrastructure wherever possible. You will want to work with your organization’s enterprise architecture team to understand the technical direction of your company and with your reuse engineering team (if you have one) to identify and leverage existing assets.
- Fix existing legacy systems and data sources whenever possible.
- Share learnings whenever possible.
- Be governed appropriately. Like it or not, your team is being governed. Effective IT organizations recognize that agile teams need to be governed in an agile, not a traditional, manner.
Your agile data warehousing team will be affected, hopefully in a positive manner, by other teams running in parallel to you. Your team will work closely with the Enterprise Architects, your Architecture Owner may even be an EA, to understand their long-term vision. You will work with the Data Management group to understand and access existing legacy data sources. You’ll work with your Release Management and Operations teams to release your solution into production. Your Support/Help Desk team is likely providing enhancement requests and bug reports to your team. You may have dependencies on other delivery teams.
Not only must our team work in an enterprise aware manner, these other teams need to be prepared to work in a more agile manner when interacting with our DW team. It will be very difficult for our DW team to work in an agile manner if they rely on other teams who aren’t prepared, or worse yet sufficiently skilled, to work in an agile manner too. Having said that, the entire EA or DM team doesn’t need to be agile, but at least enough people on those teams are so that they can support the agile delivery teams appropriately.
1.3 Choose the Right Lifecycle
As I describe below, my advice for DW/BI teams new to agile to:
- Adopt an agile project lifecycle for their first release.
- Adopt a Lean continuous delivery “product” lifecycle for subsequent releases (if sufficiently skilled to do so).
1.3.1 First Release: Agile Project Lifecycle for DW/BI Teams
Figure 2 depicts a Scrum-based agile project lifecycle. This lifecycle deviates from the common Scrum lifecycle in two important ways. First, it depicts a three-phase delivery lifecycle, not just a single-phase construction lifecycle. Do agile lifecyles have phases? Yes! Second, it shows external inputs coming into the team from other areas of your organization.
Figure 2. The Agile Project Lifecycle (click to enlarge).
Although phase tends to be a swear word within the agile community, the reality is that the vast majority of teams do some up-front work at the beginning of an initiative. The Scrum-based lifecycle above explicitly calls out three phases:
- Ideation/Sprint 0. During this phase team initiation activities occur. This includes initial scoping, initial architectural modeling, high-level release planning, putting the team together, starting into your risk management approach, setting up your work environment, and securing funding for the rest of the release. On average this effort takes longer than a single sprint, even though this is called “Sprint 0”. During this phase we do some very streamlined envisioning activities to properly frame the initiative. In greenfield environments this may take longer than environments where you have an established infrastructure. For example, as part of your initial architecture activities you are likely to have to invest more time thinking through your architectural options and may even need to run short experiments such as spikes or proofs of concept. It takes discipline to keep Ideation short.
- Construction. During this phase a disciplined team will produce a potentially consumable solution on an incremental basis. They may do so via a set of sprints/iterations or do so via a lean, continuous flow approach (see Figure 2). The team applies a hybrid of practices from Scrum, XP, Agile Modeling, Agile Data, and other methods to deliver the solution. More on this later.
- Deploy. For many teams, deploying the solution to their stakeholders is often a complex exercise. Agile DW/BI teams, as well as the enterprise overall, will streamline their deployment processes so that over time this phase becomes shorter and ideally disappears as the result of adopting continuous deployment strategies. It takes discipline to evolve deployment from a phase to an activity.
Agile data warehousing teams that are working on their first release likely follow the Scrum-based lifecycle. Because they are working on the first release they will need to invest time in basic initiation efforts discussed earlier. Furthermore, because they are likely new to agile they will find the Scrum lifecycle to be easier to adopt as it prescribes the timing of common practices (such as planning, demos, and retrospectives) and forces the team into delivering on a regular basis (there should be more working stuff that could potentially be deployed to stakeholders at the end of each iteration/sprint). A detailed description of the type of activities that occur during each phase appears later in this article.
1.3.2 Subsequent Releases: Lean Product Lifecycle for Agile Data Warehousing Teams
Figure 3 depicts a lean continuous delivery lifecycle. This lifecycle is often followed by DW/BI teams who are evolving an existing data warehouse that is already running in production. The requirements for such solutions often come in continuously from stakeholders instead of a large batch. These requirements are often small in nature, typically adding a new data field, updating an existing report or data download, or adding a new report/download. Furthermore these requirements are self-contained and can be added easily to a well-architected DW/BI solution. Many of these requirements can be implemented in a few hours or days, so it often makes sense to do the work right away and release the new functionality as soon as you can. The temporal overhead of iterations/sprints, let alone regularly scheduled releases, of the Scrum-based lifecycle often doesn’t make sense for evolving an existing production DW thus the need for a different approach.
Figure 3. A Lean Continuous Delivery Lifecycle (click to enlarge).
The Lean Continuous Delivery lifecycle of Figure 3 varies from the Scrum-based lifecycle of Figure 2 in several important ways:
- It supports a continuous flow of delivery. In this lifecycle the solution is deployed as often, and whenever, it makes sense to do so. Work is pulled into the team when there is capacity to do it, not on the regular heartbeat of an iteration. This enables your agile data warehousing team to be more responsive to stakeholders.
- Practices are on their own cadences. With iterations/sprints many practices (detailed planning, retrospectives, demos, detailed modeling, and so on) are effectively put on the same cadence, that of the iteration. With a lean approach the observation is that you should do something when it makes sense to do it, not when the calendar indicates that you’re scheduled to do it. This enables you to streamline your activities more, BUT requires greater discipline.
- It has a work item pool. All work items are not created equal. Although you may choose to prioritize some work in the “standard” manner, such as a value-driven approach as Scrum suggests, not all work may not fit this strategy. Some work, particularly that resulting from legislation, is date driven. Some work must be expedited, such as fixing a severity one production problem. So, a JIT prioritization strategy and not a prioritized stack (as in the Scrum-based lifecycle) makes a bit more sense when you recognize these realities.
2. Agile Data Warehousing
Let’s explore how an agile data warehousing team works in practice. Table 1 summarizes the primary and secondary development activities that are potentially performed by the team. Primary activities are ones that add direct value to the development effort. Secondary activities tend to focus on long-term documentation that may add value in the future, but the value proposition tends to be dubious in practice so you want to be very careful in how much effort you invest in them. I tend to think of these as sideline activities that I would only do if the team has time to spare from primary activities.
Table 1. Activities on an Agile Data Warehousing Team.
Phase | Focus | Primary Activities | Secondary Activities |
Sprint 0 |
|
|
|
Construction |
|
|
|
Deploy |
|
|
|
For now we’ll assume that your team is working on the first release of a DW solution. As a result you will need to take a three-phase approach. For teams working with an existing DW solution that is already running in production, you may find that you do not need to work through Ideation (or at least you only need an abbreviated version of it).
2.1 The Ideation/Sprint 0 Phase for Agile Data Warehousing Teams
During the Ideation phase, also known as Sprint 0, the team strives to perform just enough work to get going in the right direction. Disciplined teams will spend a few days or perhaps a week or so to do so, not several weeks or months. The key is that they work in a very streamlined manner.
Potential primary activities on an agile data warehousing team include:
- Initial usage modeling. Agile DW teams take a usage-driven, not a data-driven approach to modeling. Understanding the data is still important, don’t get me wrong, it is just that it isn`t anywhere near as important as understanding how the data will be used. The most common strategy used by agile teams to explore usage is to write user stories and epics (which are large user stories). However, I’ve found that for data-oriented requirements you are better to write question stories which are an extension of user stories. Examples of question stories that would support the development of a DW/BI solution for a retail bank include “As a Branch Manager I need to analyze the portfolio of a customer so that we can target services to them”, “As a Branch Manager I need to explore the transactions occurring in my branch so that I better understand my customer needs”, or “As a Mortgage Officer I need to explore the risk profile of a potential mortgage holder so that I can decide how much we can loan that person.” Notice how all of these requirements focused on usage, not data details. The details can come later.
- Initial conceptual modeling. An important, supporting model to the usage model is a high-level conceptual model, sometimes called a domain model. This diagram should indicate the main entity types within the domain and the relationships between them. It does not need to indicate potential data elements, nor does it need to be perfect, it just needs to be reasonably close at this time. The goal is to gain a reasonable understanding of domain terms at this time, the details will emerge later during construction. For the banking application the domain model may indicate entity types such as Customer, Account, Mortgage, Loan, Branch, Portfolio, and perhaps another twenty or so entity types.
- Identification of potential data sources. Early in a DW/BI initiative the team needs to identify the main (potential) sources of data. This information is often captured on a network diagram, or something similar. This diagram will overview the flow of information within your technical architecture, indicating the potential data sources, how information is obtained from those sources (see next activity), and how the data flows through your firewalls, staging areas (if any), your data warehouse(s), and any data mart(s).
- High-level data source analysis. During Ideation you will want to obtain basic information about potential data sources as who the primary contacts are, what type of data it contains, how is that data accessed (e.g. via file transfers, via SQL queries, via web services, and so on), and sizing information (e.g. volume of data and rate of change). The goal right now is to gain sufficient understanding of the data sources so you can make intelligent architecture decisions about them.
- Initial architectural modeling. Early in an agile DW/BI initiative you want to identify a viable architectural strategy. Part of that strategy will be identifying potential data sources; part will be identifying how data will flow from the data sources to the target data warehouse(s) or data marts; and part will making that flow work through combinations of data extraction, data transformation, and data loading capabilities. The layout of the technical architecture is often captured using a network diagram, discussed earlier. Architectural notes, in particular important technical decisions as well as good things to know (such as the data source information described earlier), are often captured in an Architectural Handbook. This handbook is often implemented as a collection of wiki pages so that anyone who is interested may have access to the information.
- Initial release planning. Contrary to what you may have heard, agile data warehousing teams perform some high-level release planning. Teams are often required to guesstimate the potential cost of the release they are working on as well as the potential delivery date. These guesstimates, or estimates if you prefer, are best presented as ranges so as to reflect the uncertainty of the information the guesstimate is based upon.
- Adopt common guidelines. Effective agile teams are enterprise aware, an aspect of which means that they strive to adopt and then follow common guidelines. These guidelines include data guidelines, security standards, coding guidelines, user interface guidelines, and many others.
- Initate a data testing strategy. Testing is so important on agile teams that we do it all the way through the lifecycle, not just during some phase at the end of the lifecycle. This includes the testing of all functionality, including any functionality pertaining to data-oriented issues, and around the testing of data itself. There are many things that can be tested pertaining to databases, see Database Testing: How to Regression Test a Relational Database for some thoughts on the subject. Your data testing strategy should address issues such as how to test extract-transform-load (ETL) logic, how to validate data sources, how to ensure the quality of the data in the data warehouse(s) and data mart(s), what tools will be used, and identification of who has the skills to do the work. This may be the greatest challenge faced by traditional data professionals as they adopt agile ways of working (WoW) – not only do few traditional data professionals have data testing skills the vast majority don’t even realize how critical those skills actually are.
Potential secondary activities include:
- Detailed data modeling (partial). You may need to start doing some data modeling, both logical data model (LDM) and physical data model (PDM) development, during Ideation. Your LDM, if you create it at all, is typically used for detailed data analysis, an activity which occurs during Construction. Similarly your PDM(s) are used to design the database schema of your DW and data mart(s), also work that is typically done during Construction. Having said that, during Ideation you may choose to do detailed look-ahead modeling of high-priority requirements, and the design to support them, that you intend to implement in the first iteration or two of Construction. As a result you MIGHT do some data modeling work, see below for a more detailed discussion of what that might entail.
- Source-to-target data mapping (partial). Part of your look-ahead modeling effort, sometimes called “backlog refinement” by Scrum practitioners, will be to do just enough data mapping to implement the first few stories in your backlog. You need to know where the data is coming from to implement just these stories. Yes, any given data source may have hundreds of data elements that your team may potentially be interested in at some point, but for now you just need to map the handful of data elements required to implement the first few stories. That’s it. Future data mapping, if at all, will be performed in an evolutionary manner throughout construction.
- Detailed data source analysis (partial). Similarly, you will do just enough analysis of your data sources to understand just the data elements required for the first few requirements.
2.2 The Construction Phase for Agile Data Warehousing Teams
You are likely to adopt a collection of activities as we saw in Table 1 above. Potential primary activities include:
- Development of vertical, fully functional slices. Each iteration the DW/BI team will produce a solution that is consumable, something that could be potentially shipped into production that people want to actually use. This means that you will analyse, design, implement, and test that functionality during the iteration (and most agile teams work in iterations that are two weeks in length or less). This is why it is so important to take a usage-driven approach and not a data-driven approach – your team needs to be always working on some new functionality that adds real value to your stakeholders. In a given iteration you do the work to completely implement one or more reports, or perhaps a portion of a report or an enhancement to an existing report, in a single iteration. You will do the work to extract the data from the data source(s), transform/clean it, and load it into the DW. This can be tough initially because you will not have the infrastructure in place yet during the first few Construction iterations. For example, the first time you extract data from a data source you’ll need to do a lot of the work to access that data source. You can read more about implementing a data warehouse via vertical slicing.
- Prove the architecture with working code. There are always technical risks on DW initiatives. Maybe the technologies that you’re working with are new to your organization. Maybe several data sources are difficult to work with, either because the owners of the data sources are difficult, because there are quality issues with the data, because there are architectural differences between the data sources (e.g some are real time and some are batch systems), or perhaps because there are volume challenges (i.e. “big data”). Agile teams remove these sorts of risks by implementing functional requirements that touch on the risks right away. Worried about accessing data from a batch system? Start by writing a report that needs data from it. Worried about your whether your ETL tool is going to work well? Implement one or more requirements that require the key features of that tool. Worried about whether you’ll be able to handle the big data load? Implement a requirement needs that data. Many times traditional teams, and even undisciplined agile teams, will put off hard aspects of their architecture to the end of the lifecycle, thereby increasing the potential costs of fixing any problems that they do run into. Agile teams prefer to address their risks as early as they can when they have the most time and resources to respond to the problems.
- Detailed data source analysis. Data source analysis occurs on a just-in-time (JIT), or near-JIT basis. Your team will do the analysis required to implement the current requirements (let’s assume they’re captured as user stories). So, if a story requires data from three data sources then you will do the analysis of for those data elements from those data sources. Of course, you’re likely doing the analysis for several stories at a time, perhaps five or six stories. Furthermore, disciplined teams will be doing look-ahead modeling, described earlier, where you are doing the analysis for stories that are coming up in the next iteration or two. The basic strategy is to have just enough data source analysis done before you go to implement the actual functionality. You are likely to do more data source analysis in earlier Construction iterations as opposed to later iterations – as you populate your DW, the data required for new reports or queries is more likely to be there over time. Traditional teams have a tendency to do comprehensive data source analysis one data source at a time, followed by the implementation work needed to obtain and then load the data into the DW. This appears efficient from the point of view of the person(s) doing the work, but proves to be rather inefficient in practice from the point of view of your stakeholders for two reasons. First, it takes much longer to get to the point where you have sufficient data in place to implement the reports, or to support answering their questions, that they actually want. In other words, you have a very high cost of delay (also referred to as opportunity cost). Second, you end up analyzing (and then implementing) data elements that aren’t actually needed.
- Implementation of source-to-target data mapping. The implementation work to extract the data from source, transform the data as required, and then load it into your target database(s) will be done in a JIT, evolutionary manner. Each iteration your team will do the work to implement one or more stories from end-to-end (e.g. vertical slices through your solution) and part of this work is the implementation of the source-to-target data mappings.
- Database refactoring. A refactoring is a simple change to your design that improves its quality without changing its semantics in a practical manner. A database refactoring is a simple change to a database schema that improves the quality of its design OR improves the quality of the data that it contains. Database refactoring enables you to safely and easily evolve database schemas, including production database schemas, over time. This technique, in combination with database regression testing and continuous integration, allow us to develop data warehouses, or any solution involving a database for that matter, in an agile manner.
- Physical data modeling. The physical data model(PDM)s describing your databases, including both source and target databases, will evolve throughout Construction. Please see the article Agile/Evolutionary Data Modeling for a detailed description for how to go about agile data modeling.
- Regression testing. Quality is paramount for agile teams. Disciplined agile teams will develop, in an evolutionary manner of course, a regression test suite that validates their work. They will run this test suite many times a day so as to detect any problems as early as possible so that they can address them as cheaply as possible (remember, the average cost of fixing a defect rises exponentially the longer it takes you to find it). In fact, very disciplined teams will take a test-driven development (TDD) approach where they write tests before they do the work to implement the functionality that the tests validate. As a result the tests do double duty – they validate and they specify (which is one of many reasons why agile teams require far less documentation than traditional teams, their specifications are in effect executable as opposed to static). Please see Database Testing: How to Regression Test a Relational Database for a more detailed description of this strategy.
- Continuous integration (CI). CI is a technique where you automatically build and test your system every time someone checks in a code change. Agile developers will typically update a few lines of code, or make a small change to a configuration file, or make a small change to a PDM and then check their work into their configuration management tool. The CI tool monitors this, and when it detects a check in it automatically kicks off the build and regression test suite in the background. This provides very quick feedback to team members, enabling them to detect issues early.
- Continuous deployment (CD). When an integration build is successful (it compiles and passes all tests) your CD tool will automatically deploy to the next more appropriate environment(s). For example, if the build runs successfully on a developer’s work station their changes are propagated automatically into the team integration environment (which automatically invokes CI in that space). When the build is successful in your team integration environment perhaps it’s promoted into an integration testing environment, and so on.
- Continuous documentation. Your team should be solution focused, not just software focused. Because documentation is part of the overall solution that you deliver, you should be develop key documentation (system overview documentation, help guidelines, and so on) as you develop the software. For more information, see the Agile Modeling practice document continuously.
- Detailed planning. Planning is so important on disciplined agile teams that we do it all the way through Construction. Detailed planning occurs at the beginning of each iteration (for teams following the Scrum-based lifecycle) or in an as-needed, just in time (JIT) manner (for teams following the lean/continuous delivery lifecycle). Team members may also choose to engage in look-ahead planning to begin thinking through the next iteration or two in complex situations.
- Coordination meetings. The team needs to coordinate their work both internally within the team and externally with other teams.
- Demos. Agile teams demonstrate their work on a regular basis, typically at the end of each iteration. This demo, typically run by your product owner, shows off the work that your team has accomplished since the last demo. Because you are taking a usage-driven approach, each iteration you should have added more functionality that provides real value to your stakeholders. For example, your demo might walk through a new report that your team built that iteration, show how a calculation was updated on an existing report, and show how there are seven new data columns available to people doing ad-hoc reporting.
- Retrospectives. One of the principles behind the Agile Manifesto is that teams should regularly reflect on what they’re doing and strive to learn and improve their approach. Retrospectives are a simple technique for doing exactly that.
Potential secondary activities include:
- Logical data modeling. In practice, logical data modeling tends to add very little value to the overall development effort (other than the “value” of keeping logical data modelers employed of course). If you do decide to invest time in an LDM, keep it as streamlined as possible and DO NOT allow this effort to slow down development. If you think that your LDM can offer actual value to your organization, and in the traditional world that’s possible, then ask yourself how you can add the same value using tests. I’ll write up a more detailed article on this at some point in the future. For now my best advice is to be very leery about LDMing.
- Documentation of source-to-data mapping. You will likely find that you want to document your mappings. Once again, you will find it more effective to capture these mappings in the form of tests (assuming you have the skills to do so) rather than static documentation. If you find that you need to resort to documentation, remain as agile as possible and keep the documentation concise. Please see Agile/Lean Documentation Strategies for more ideas about how to keep your documentation concise and sufficient.
- Meta-data documentation. Once again, follow agile documentation strategies for capturing any meta-data information.
2.3 The Deploy Phase for Agile Data Warehousing Teams
During the Deploy phase the team strives to ensure that the solution is consumable, and when it is they deploy the solution. To address these process goals, you are likely to adopt a collection of activities as we saw in Figure 5 above. Potential primary activities include:
- End-of-lifecycle testing. Some testing may slip into the Deploy phase. Ideally all testing should occur during Construction, other than one last run of your regression test suite to ensure you’re ready to ship. But it isn’t always an ideal world. See end-of-lifecycle testing for more details.
- Last-minute fixes. If you perform end-of-lifecycle testing there is always the risk that the testing effort finds some bugs. Your Product Owner may decide that these bugs need to be fixed before you’re ready to ship.
- Finalize deliverable documentation. Some teams will let documentation slip, or at least some documentation slip, to the end of the lifecycle. This is a practice called Document Late, although I prefer a continuous documentation approach described earlier.
- Deploy database schema changes. Part of your overall deployment efforts will be to deploy database schema changes. If you’ve been taking a database refactoring approach then this will be very straightforward as your change scripts will already be running and fully testing. Before making the schema changes you should consider creating a backup of the database. You may find The Process of Database Refactoring: Install Into Production, to be an interesting read.
- Migrate production data to new schema. TBD.
The only potential secondary activity is to finalize any secondary documentation, such as your logical data model (LDM) or your meta-data documentation, that you believe will add real value in the future. The risks of investing too much effort on these sorts of activities have been discussed earlier.
3. Artifact Creation by DW Teams: Traditional vs. Agile
Figure 4 compares the typical level of expended effort creating artifacts on traditional and agile teams. There are several interesting differences to between the approaches. Agile data warehousing teams will:
- Create a high-level conceptual model early. A high-level conceptual model, or more accurately diagram at this point, identifies the critical business entity types and the relationships between them. This provides vital insight into the domain while helping to capture key domain terminology, thus helping to drive consistency of wording in other artifacts (such as user stories and epics). Traditional teams will often make the mistake of over documenting the conceptual model early in the lifecycle, injecting delay into the team (with the corresponding opportunity cost of doing so) as well as the risk of making important decisions when you and your stakeholders have the least knowledge of the actual end goal.
- Evolve a minimal logical data model (LDM) over time. If your agile data warehousing team does this at all they will keep their LDM very concise and easy to evolve. Traditional teams will often invest heavily in their LDMs as they believe it is a mechanism to ensure quality and consistency through specifying it in. This often proves to be wishful thinking in practice. Agile teams instead invest their efforts in creating an executable specification in the form of regression tests (more on this below).
- Evolve a detailed physical data model (PDM) over time. Agile teams realize that a PDM, when created via a data modeling tool with full round-trip engineering (it generates schemas as well as imports existing schemas), effectively becomes the source code for the database. As the requirements evolve the team will evolve the PDM to reflect these new needs, generating schema changes as needed. They can work this way because they are able to easily refactor and regression test their database. This is different from the traditional approach where they often perform detailed modeling up front. This is motivated by the mistaken belief that production database schemas are difficult to evolve, something that agilists know not to be true.
- Develop a comprehensive regression test suite over time. These tests address several important issues. First, they validate the work of the team, showing that their work to date fulfills the requirements as they’ve been described to the team. Second, a regression test suite enables the team to safely evolve their work. Agile developers can make a small change, rerun their regression tests, and see whether they broke something (if so, then they either rollback their change or they fix what they broke). Third, when a test is written before the corresponding database schema or database code is developed, the test effectively becomes a detailed executable specification. Sophisticated agile data warehousing teams will capture the kinds of information that were previously captured in LDMs in executable tests, and are thus much more likely to have consistent schemas than teams that still rely on static LDMs.
- Capture critical meta-data over time. Because the rest of your organization may not be completely agile there is often a need to continue to capture key meta data about data sources. This meta data should be kept as light as possible. If there isn’t a definite need for it then don’t capture it. If someone says “but we might need it someday” then wait until someday and invest in capturing the information at that point. Furthermore, instead of capturing meta data in a static manner (i.e. as documentation) try to identify ways to capture it as tests, or to generate it automatically from other information sources. Any documentation that you write today needs to be maintained over time, slowing you down.
Figure 4. Comparing Artifact Creation by DW Teams (click to enlarge).
For a better understanding of why traditional DW teams are likely to write too much documentation far too early in the lifecycle, you should read the article The Cultural Impedance Mismatch.
4. Related Resources
- Clean Database Design
- Data Vault Alliance
- Implementing a Data Warehouse via Vertical Slicing
- Agile Core Practices for DW/BI Teams
- The Disciplined Agile Web Site
- The Agile Modeling Web Site
- Look-Ahead Data Analysis
- Question Stories
- Workshop: Continuous Data Warehousing (DW): A Disciplined Hybrid Method for Practitioners – Two days
- Workshop: Continuous Data Warehousing for Leaders – One day
Trademark Notice
“Disciplined Agile” is a registered trademark of Project Management Institute.