When it takes several days, and sometimes weeks, to perform data analytics
how do you remain agile? The answer is to perform look-ahead data analysis.
What is Look-Ahead Data Analysis?
As the name implies, this is effectively the application of Agile Modeling's
practice to data analysis. Scrum teams refer to look-ahead modeling as backlog refinement or even backlog grooming.
The basic idea is that you do just enough analysis work
to explore and understand the data source(s) so that a data requirement,
likely captured in the form of a
can be implemented. When you have existing, high quality reporting data sources
(data warehouses (DWs), data marts, data lakes, ...) that contain the data you need then
the data analysis work is fairly straightforward. When this is not the case, when
the data resides in legacy OLTP (online transaction processing) databases or in external data sources
(think big data) then you may require significant effort to explore, understand, and document
the incoming source data. It is this effort that look-ahead data analyis focuses on.
Data analysis activities may include:
- Identification of potential data sources
- Profiling/exploring data sources
- Identification of multiple sources of specific data
- Assessment of data quality
- Selection of the best source(s) of specific data
- Formulation of data cleansing rules
- Mapping of source data to required data elements
Why Look-Ahead Data Analysis?
There are several reasons why you need to perform look-ahead data analysis:
- Data source documentation isn't trustworthy. Very often the artifacts
describing existing data sources, if there are any at all, aren't complete nor up-to-date.
As a result you need to analyze the legacy data source(s) to determine what is
actually there and what is potentially usable.
- Data engineers need to understand the source data. The data engineers, or developers in some cases,
need to understand what data they will use and how they need to manipulate it, to
implement whatever is required to address a given question story.
- You want the best data available. You very often have many data sources available
to you capable of providing the data you need. One goal of data analysis is to
help you to identify the best available source for the data that you require, not just
a source. Be quality infected.
Look-Ahead Data Analysis on Agile Teams
Many teams choose to follow an
typically based on Scrum.
Figure 1 depicts the look-ahead data analysis work required for
three question stories
that are to be implemented during sprint #9 of an
agile DW/BI initiative.
Notice how each question story requires a different amount of data analysis effort due to the fact
that every question has unique data needs.
Figure 1. Look-ahead data analysis on an agile team. Click to enlarge.
There are several interesting implications for look-ahead data analysis on agile/Scrum teams:
- You need to get good at
To schedule the look-ahead data analysis
properly the person(s) doing the work will need to guesstimate the amount of effort required for
each question story. This is required because the people with
have a limited amount of capacity.
- The data analysis of several sprints will overlap. Figure 1 depicts the look-ahead data analysis efforts for the three question stories being implemented in sprint#9 only. Previous sprints would have also required similar efforts.
- You will be limited by availability of agile analytics skills. See the discussion of staffing below.
- The first few sprints will be rough. The data analysis work needs to get ahead of the implementation work. The implication is that it may be a few sprints before the team delivers sufficient functionality to address a single question story.
- Your definition of ready (DoR) must address data issues. Many agile/Scrum teams have a DoR that defines the minimum level of quality of work that needs to be put into a story before the team is willing to work on it. See DoRs for question stories for a detailed discussion.
Look-Ahead Data Analysis on Lean Continuous Delivery Teams
Advanced teams choose to follow a
lean continuous delivery lifecycle,
based on Kanban and DevOps strategies.
Figure 2 depicts the look-ahead data analyis work for the same three question stories
from Figure 1, the difference being that the work is done on
a just-in-time (JIT) basis rather than scheduled into fixed-length sprints. Note that the same amount
of data analysis is still required for each user story as in Figure 1, but that the implementation
time is no longer tied to a two-week sprint.
Figure 2. Look-ahead data analysis on a continuous delivery team. Click to enlarge.
There are several interesting implications for look-ahead data analysis on continuous delivery teams:
- There are no scheduling issues. The work is done on a JIT basis, removing
the need for the guesstimation and scheduling overhead. The team takes on the work
for a new question story once they have the capacity to do so.
- It's really just part of the implementation work. Because the work is done on a
JIT basis, there's less need to distinguish between analysis work and any other work required
to implement a question story.
- It will be easier to collaborate and learn together. While there is still
a challenge around having sufficient people with
agile analytics skills,
the people performing data analysis are more likely to be working closely with the
data engineers and developers and thus are more likely to share their skills with
them. As a result you are likely to grow more people, and so so faster, with
agile data analysis skills.
Factors That Affect The Timing of Look-Ahead Data Analysis
There are several factors that will determine how far ahead you need to
perform look-ahead data analysis:
- The existing data in your data warehouse (DW).. The more data that you have in your reporting data sources (DWs, data marts, data lakes, ...) and the higher the quality of it then the less data analysis you will need to perform of legacy data sources.
- The complexity of the data source(s). The more complex the data source, either in structure or in contents, the longer it will take to analyze.
- Your ability to gain access to the data source(s). You may not have easy access to a data source that you need to analyze, and it can take time to gain that access.
- The quality of the existing documentation. The higher the quality of the documentation describing the data source(s), if any exists at all, the easier it will be to understand the data source.
- The difficulty of the question being asked. The more difficult, or complex, the question to be answered then very likely it will take longer to analyze the data source(s) required to answer that question.
- The skill, experience, and knowledge of the data analyst(s). Highly capable data analysts are generally more effective than novices, and thus work faster and generally produce better results. In short, it really does depend on who is doing the work as to how long it will take.
- The availability of the data analyst(s). You will need to staff accordingly. As more people with data analytics skills become available, the shorter the lead/wait time will be for people to do the work.
- Your data profiling tools. The more effective your tools, the easier it will be to explore existing data source(s).
Staffing Look-Ahead Data Analysis
There are several challenges that you are likely to face when staffing
for look-ahead data analytics. These challenges, and potential solutions,
are summarized in Table 1.
Table 1. Staffing challenges for look-ahead data analysis.
Lack of people with agile analytics skills - This is the primary limiting factor.
- Hire agile data analysts. Unfortunately it is currently difficult to hire agile data professionals as the demand outstrips the supply.
- Motivate people to become generalizing specialists. A generalizing specialist has one or more specialties, such as data analysis, and a general knowledge of the domain that they are working in. If you can nudge people to pick up data analysis skills as one of the specialties that they are working on, or at least to pick up basic skills, then you will soon have a growing number of people capable of look-ahead data analysis.
- Adopt collaborative strategies. When people work closely together they pick up skills from one another, particularly when they follow non-solo work techniques such as pairing or mobbing. If traditional data analysts purposefully pair/mob with agilists then they will pick up agile WoT and WoW and the agilists will pick up valuable data analysis skills.
- Train and coach existing data analysts in agile. This is possible, although it can be difficult to find agile data coaches.
- Train and coach existing agile developers in data analysis. This is possible, although will likely be very difficult as few agile developers have a background in data. This is likely to require significant investment as a result.
- Adopt better tooling. There are great data profiling, data modeling, and data extraction tools available to you. Are you using them?
Sprint scheduling conflicts - You can only do so much look-ahead data analysis at any given time.
- Create smaller question stories. Although this depends on the needs of your stakeholders, if you are able to simplify question stories then the data analysis required to support their implementation will likely decrease. The less analysis there is to do, the less likely you'll have overlap between sprints.
- Adopt shorter sprints. By shortening (yes, shorten, not lengthen) your sprint length, usually from two weeks to one week, you will motivate three important behaviors: You will find ways to improve your way of working (WoW), reduce the size of question stories, and reduce the number of work items you address each sprint. All of these behaviors will help reduce the change of schedule conflicts.
- Adopt a lean, continuous delivery approach. You significantly reduce your overall coordination/scheduling overhead by moving away from sprints/timeboxes, as you learned above.
- Increase the number of people with agile data analysis skills. The more people who are able to perform agile data analysis, the less likely sprint scheduling conflicts will occur.
Centralized data teams - They become a bottleneck when multiple teams need help simultaneously.
- Build whole teams. A team is whole when it has sufficient people, with the right skills, to achieve the outcome(s) they have taken on. The implication is that if a team needs to perform agile data analysis then they should have the people on the team with those skills and not rely on another team to do the work for them.
- Create an agile data community of practice (CoP). A CoP is a group of like-minded people who choose to learn together. One of the reasons why centralized teams exist is the belief that it's the only way to develop a group of people with those skills. The fact is that CoPs are another way to accomplish that goal.
- Increase the number of people with agile data analysis skills. The more people you have with agile data analysis skills, the easier it will be to rework the centralized data team.
Look-Ahead Data Analysis in Context
The following table summarizes the trade-offs associated with look-ahead data analysis
and provides advice for when to adopt it.
Table 2. Look-ahead data analysis in context.
- Provides data analysts with the opportunity to perform sufficient data analysis to explore the requisite data sources before implementation begins.
- Increases the chance that the data engineers will work with the best data in the most appropriate way.
- Ensures that the data engineers have the information they require to implement a question story.
- Supports a definition of ready (DoR) strategy popular with teams following a Scrum-based approach.
- Requires people with sufficient data analysis skills.
- The required effort can be hard to predict due to accessibility of, and quality issues with, source data. This analysis work may range from several hours to several weeks in effort.
- Data analysts new to agile will often fall into the trap of over-analyzing the source systems rather than focusing on just the data required to address the question story they are currently performing analysis for.
When to Adopt This Practice
Whenever you require data from legacy data sources that has not yet been brought into your existing
data warehouse/lake/... environment.