Skip to Content


Estimating Ambiguity

Scrum's story points rapidly turn into a game of Numberwang. What are we actually trying to do there, and can we do it without wasting everyones time?

When working on an “agile” team, one of the key pieces is trying to figure out how many of the things you need to do you think you can do in the next two weeks. One of the more common tools to determine this is called “story points”. Each task that needs to happen is turned into a ticket, and the team sites around and “sizes” the tickets by assigning them a number from the Fibonacci sequence – 1, 2, 3, 5, 8, 13. These numbers are unit-less. This turns into a game of Numberwang pretty quickly.

What are we actually trying to do here, and is there a way to do it that feels less pointless and random? During a recent session of “agile” training I pressed the consultant on the purpose of these numbers. Why are these helpful? We know they are wrong, and will be wrong. We know that small things will be big things, and big things might be small things. How does playing Numberwang on a pile of tickets create helpful estimates for work timelines? Apparently, as long as the ticket sizes are clumped around a median within a standard deviation or two, it doesn’t matter where any ticket falls. They all wash about to about right when you add them up. In other words, the amount time a ticket will take is defined by a probability distribution. And probability distributions are something I know about. Sizing tickets is a random number generator, but one that’s guided into a probability distribution by the teams fingertips-feeling about their work.

If we take a step back and think about what we’re actually trying to model, can we find a different way of doing it?

In a project that is complex, uncertain, and ambiguous, we want to estimate the amount of work units required to achieve an outcome. We don’t have any way of knowing if achieving a particular outcome will require few or many units of work. If we look at all the units of work our each of our previous outcomes took, we can determine their median and the standard distribution of the collection. If our future set of outcomes we would like to realize are similar in scope, complexity, and ambiguity, we can use our probability distribution to estimate how likely any given amount of work it will take to complete them all.

This requires a few changes from the traditional way of modeling this work, but the good news is that none of these changes need any data we don’t already have, and the entire team doesn’t need to participate. We just count up how many work units — defined by contiguous blobs of focused time — a given task consumed on it’s road to completion. We count these by tracking the working days the task was under active development (not waiting around for review or approval or whatever) and counting how many blobs of time we had in each day. That gives us work units per task. Once we get a goo pile of these, we can calculate the median and standard distributions, and model away. As tasks get done, they add to the total distribution model, further refining it.

This explains the trajectory for many small tasks. The more points of data in a model like this, the smoother the probability function and the more defined the estimates. Having lots of things than can swing a little either direction creates a more predictable distribution (ie, a tighter standard deviation) than a few things than can swing a lot in either direction (ie, a broad standard deviation).

This method operates differently than story points in a couple of key ways — we don’t need to deal with abstract unit-less numbers that only gain relevance to the real world once aggregated into a time period. N story points only makes sense when you know your team averages X story points per sprint. We use real blocks of real time which you can count – if you have holidays or surprise meetings, or get sick, all of that can be counted and make their way into your estimate. It also means that aggregating per time period can happen for any given time period, not just on the classic two week cycle. This also means it doesn’t matter if tasks stretch past the sprint. Messing with the periods won’t mess with the math, since the anchor to reality is not the time periods — it’s the work.

An essential caveat to this approach is that we must not track or count individuals work units. Beyond “these three people together put in 12 blobs of work”, disaggregating work into discrete “who did how many blobs when for what” will only create management pressure to attempt to “optimize” or “improve” numbers that are fundamentally not helpful. Creating comparisons around who did work by looking at metrics misunderstand the fundamental nature of this kind of work and will only encourage bad behaviors that will optimize for defensive project management. This is why tracking your time for hourly billing in client projects sucks so much and burns people out in agency settings. Don’t do it. All we need to know is “this task took n total blobs from start to finish.”

Follow-up; Software Development is Not Normal

Bent Flyvbjerg has done a lot thinking about this, looking at both construction mega-projects like the Olympics and large technology projects. He concludes, and I’ve seen support for this voiced among other experiences engineers (citation needed tho) that software development is a fat-tailed process, where black swan tasks occur far more often than one would expect from a normal probability distribution. The problem with fat-tailed distributions is _they have no standard deviation. This means that generating a distribution to understand how a project might unfold, you are almost certain to be underestimating — and what’s worse, theres no mechanism for generating any certainty for how wrong you are going to be. This tracks with my 10+ years of software development experience.