The Agile Data (AD) Method

The Agile Database Techniques Stack: The Dev Side of DataOps

Modern software development processes, including SAFe, Extreme Programming (XP), and Scrum are all evolutionary if not agile in nature. The implication is that if data professionals are going to be effective members of such teams, then they need to adopt tools and techniques which enable them to do so. There is nothing special about the data aspects of your IT ecosystem: they can be developed in an evolutionary manner just like non-data aspects. Even data warehouse and business intelligence solutions can be developed in an evolutionary manner (and quite frankly this is the preferred approach). This article overviews what we call the agile database techniques stack, a collection of strategies that modern data development professionals require. It is organized into the following topics:

  1. Agile database techniques stack
  2. Why is this a stack?
  3. Why adopt the agile database techniques stack?
  4. What is the best way to adopt the agile database techniques stack?
  5. Can you adopt individual techniques?
  6. Agile database techniques and DataOps

1. The Agile Database Techniques Stack

The techniques stack is overviewed in Figure 1 below.

Figure 1. The Agile Database Techniques Stack (click to enlarge).

Agile Database Techniques stack

These strategies of the agile database techniques stack are:

  1. Thin slicing. A fundamental agile development technique is to thinly slice functionality into small, consumable pieces that may be potentially deployed into production quickly. These vertical slices are completely implemented – the analysis, design, programming, and testing are complete – and offer real business value to stakeholders. Thin slicing is completely applicable, and highly desirable, in data development.
  2. Data contracts. A data contract is a formalized agreement or specification that outlines the structure, format, and data semantics of data being exchanged. Data contracts establish clear expectations and responsibilities for data quality, indicating how data should be produced, transmitted, and used. Data contracts are applied programmatically to application programming interfaces (APIs) or other encapsulation strategies to ensure the quality of data being passed through them.
  3. Clean data architecture and clean data design. A clean data architecture strategy enables you to develop and evolve your data assets at a pace which safely and effectively supports your organization – in short, to be agile. Similarly, a clean database design enables you to evolve specific data assets in an agile manner.
  4. Agile data modeling. With an evolutionary approach to data modeling you model the data aspects of a system iteratively and incrementally. With an agile data modeling approach you do so in a highly collaborative and streamlined manner.
  5. Database engineering automation. This refers to two kinds of automation. First, automated support for data engineering tasks through better tools used by data engineers. For example, improved data profiling tools or data modeling tools.  Second, outright automation of significant data engineering activities.  For example, a tool that profiles an existing data source, maps it to a Data Vault 2 (DV2) raw vault schema, and generates the extract-load-transform (ELT) functionality required.
  6. Database refactoring and data repair. A database refactoring is a small change to your database schema which improves its design without changing its semantics (e.g. you don’t add anything nor do you break anything). The process of database refactoring is the evolutionary improvement of your database schema so as to better support the new needs of your customers. A data repair is a small fix to data that addresses a data quality problem. The process of data repair is the evolutionary improvement of your data quality.
  7. Automated database regression testing. You should ensure that your database schema actually meets the requirements for it, and the best way to do that is via testing. With a test-driven development (TDD) approach, you write a unit test before you write production database schema code, the end result being that you have an automated regression test for your database schema. to ensure data quality.
  8. Continuous database integration (CDI). Continuous integration (CI) is the automatic invocation of the build process of a system. As the name implies, continuous database integration (CDI) is the database version of CI.
  9. Continuous database deployment (CDD). Continuous database deployment (CDD) is the database version of continuous deployment (CD). In CD, when your continuous integration (CI) run succeeds, your changes are automatically promoted to the next environment. CDD is the automated deployment of successfully built database assets from one environment/sandbox to the next, accounting for the implications of persistent data.
  10. Configuration management. Your data models, database tests, test data, and so on are important artifacts that should be under configuration management, just like any other artifact.

 

2. Why is This a Stack?

We call it a stack because each technique relies on you being able to perform the ones below it. For example, Figure 2 shows that continuous database integration (CDI) requires you to have a configuration management strategy and will automatically invoke your CDD infrastructure if it exists.

Figure 2. How the techniques rely on each other (click to enlarge).

Agile Data Techniques Stack - Relationships

 

3. Why Adopt The Agile Database Techniques Stack?

Any given technique has advantages and disadvantages, including the ones overviewed in this article. Every given practice works well in some situations, and may even be the “best” that you can do in those situations, but doesn’t work well in others. Practices are contextual and should be presented as such. Presenting something as a “best practice” is deceptive, in my opinion. To prove my point, the advantages and disadvantages of each agile database technique are summarized in the following table.

Technique Advantages Disadvantages
Thin slicing
  • Deliver high-value functionality sooner.
  • Reduces risk.
  • Enables opportunity for feedback.
  • Requires clean architecture, design, and implementation.
  • Requires “full stack” data capability within a team to develop and evolve the solution.
Data Contracts
  • Provides a software-engineering interface to data sources.
  • Enforces consistent semantics of data coming out of, and potentially into, a data source.
  • Requires a relatively clean design, and better yet a clean architecture, of the system being encapsulated.
  • Requires engineers to have both software engineering and data engineering knowledge and experience to appreciate (which is a good thing to need).
Clean Data Architecture
  • Easier to understand.
  • Easier to evolve, thereby enabling agility.
  • Easier to validate.
  • Requires investment to keep clean, including in architectural modeling and architectural refactoring.
  • Existing legacy architectures often have significant technical debt that needs to be addressed before your architecture is sufficiently clean.
Clean Database Design
  • Easier to understand and evolve.
  • Easier to test.
  • Increases ability to evolve data hosting strategy.
  • Requires investment to keep clean, including in agile design modeling and database refactoring.
  • Existing legacy designs often have significant technical debt that needs to be addressed before your design is sufficiently clean.
Agile Data Modeling
  • Enables evolutionary exploration of both your problem and solution spaces.
  • Enables decisions at the most responsible moment.
  • Integrates into other agile ways of working.
  • Traditional data modelers struggle with agile data modeling strategies at first, particularly when they don’t yet have full-stack agile database skills.
  • Requires the ability to safely refactor whatever is being modeled.
Automated database engineering
  • Increases the productivity of data engineers.
  • Shortens the data engineering portion of the lifecycle.
  • You still need the skills to use the tools and the knowledge to understand what the tools are doing for you (and to you).
  • You become reliant on the tools, putting you at risk if they are no longer supported (and yes, this is still a problem with open-source tools, albeit less so than with commercial tools).
Database Refactoring
  • Enables safe evolution of a data source.
  • Supports removal of data debt.
  • Enables evolutionary database development.
  • Although it is possible to implement by hand, you really want to invest in a database refactoring tool.
  • Not possible for data sources that do not support executable functionality, such as stored procedures, for example.
  • Requires the capability to schedule and then execute upon the removal of refactoring scaffolding once systems that access the data source are updated.
Data Repair
  • Enables you to reduce overall technical debt as you will no longer need to address the data quality problem wherever you use the data.
  • Often straightforward to implement
  • Increases the understandability of your data through greater consistency
  • Changes to data sources accessed by multiple clients may affect one or more of those clients, requiring corresponding updates in those clients. Without automated regression test suites in place for those clients, this can be risky.
  • Some data repairs require a deprecation period where the original and new versions are supported in parallel, increasing overall complexity.
Automated Database Regression Testing
  • Enables inclusion of data sources in automated testing strategy.
  • Enables automatic enforcement of data standards and conventions.
  • Support executable specification strategy.
  • Requires people with test-thinking and testing skills.
  • Requires an understanding that data sources are enterprise assets AND must be treated as such.
  • Traditional data groups are often unwilling at first to include testing as one of their responsibilities.
  • Lack of automated tests for existing legacy data sources requires investment to develop them.
Continuous Database Integration (CDI)
  • Automates much of the drudgery around the overall build process.
  • Increases consistency and predictability of database evolution work.
  • Brings database evolution up to common software engineering practice.
  • Enables greater visibility into the work and work products of database evolution via automated logging of results.
  • Requires investment in automation of the database development infrastructure, potentially including new tooling.
  • Requires investment in the creation of automated tests.
  • May require training and coaching of data engineers in agile development techniques.
Continuous Database Deployment (CDD)
  • Automates the complexity and drudgery of database deployment
  • Enables evolutionary approaches to database development
  • Increases regulatory compliance of database updates
  • Requires investment in automation of the database development infrastructure, potentially including new tooling
  • May require training and coaching of data engineers in agile development techniques
Database Configuration Management
  • Enables you to manage versions of databases and the assets they comprise across environments.
  • Enables you to rollback portions of a database to previous versions.
  • Enables you to modify work products in parallel and then merge the work.
  • The change history maintained for assets makes it easier to identify the source of injected defects.
  • Requires database developers to be trained in CM-related skills.
  • May require new tooling, in particular the CM system itself.

4. What is the Best Way to Adopt the Agile Database Techniques Stack?

There isn’t a single, “best” way. It depends on the context of the situation that you face. Here are some potential strategies to consider:

  1. Small, incremental improvements. Treat the techniques stack like an improvement target, making small changes to work towards it. You’re very likely doing data architecture, database design, data modeling, some database testing, configuration management of some assets, and may even be improving the implementation of some of your legacy data sources. So start evolving your current way of working (WoW) to leverage ideas captured in the various techniques of the stack. I highly suggest that anyone interested in how to effectively improve via small steps to take a look at PMI’s Guided Continuous Improvement (GGI).
  2. Take a top-down approach. When you adopt a technique, you’ll find that you need to adopt at least some aspects of the technique that it immediately relies on. This continues recursively until you reach the bottom of the stack. In some ways you’ll be adopting “vertical slices” of the overall techniques stack. What I mean by this is that you’ll adopt just enough of each technique to get some value, then adopt some more of it, and so on.
  3. Take a bottom-up approach. This makes sense from a technical point of view, and it’s certainly easier, it tends to be difficult from a management point of view. The techniques towards the top of the stack tend to have the best short-term payback whereas the techniques towards the bottom have a longer-term payback. Starting bottom up, you’re effectively starting with the hardest strategies to justify (at least in organizations struggling to operate with a value-driven mindset rather than a cost-driven one).
  4. Take a middle-out approach. Some people choose to start with the more technically interesting techniques, particularly database refactoring and automated database regression testing, often because those are most likely to be new to them and to their organization. Other people will focus on improving what they’re already doing, data modeling and database design, but adopting more effective agile approaches. Either way, you will still need to swiftly adopt techniques lower on the stack.
  5. Adopt everything at once. This can be chaotic because in effect it’s a large change for your data group. As pointed out above, the top-down strategy tends to quickly evolve into this one anyway, although perhaps via “vertical slices” of the overall technique stack.

The list above is ordered by my personal preference, which is driven by what I have found to work in practice. But it really does depend on your situation, one approach does not work in all situations. There are no best practices.

5. Can You Adopt Individual Techniques?

Yes.

Each technique offers value on its own. However, because they build on each other, as you saw in Figure 2, and in the previous section, they are more effective when adopted together.

 

6. Agile Database Techniques Stack and DataOps

Figure 3 depicts the DataOps lifecycle, the mobius loop, with a collection of data and database techniques mapped to its activities. The figure includes several techniques that are not called out by the agile database techniques stack, such as data lineage and data security. This is because the agile database techniques stack is a subset of the agile data techniques available to you, capturing the ones most critical to the development side of loop.

Figure 3. Mapping data techniques to the DataOps lifecycle (click to enlarge).

DataOps/Data DevOps