Massively Useful

March 6, 2014

Root Cause Analysis: Asking Five Why’s

There’s a technique used in Root Cause Analysis called “5 Why’s.”

The idea is, you define the problem, and ask why that problem occurred. Get the answer to that question. Then ask why THAT happened. And repeat, until you’ve gone at least 5 levels deep.

This trivial example comes from the link above:

You are on your way home from work and your car stops.

Why did your car stop? Because it ran out of gas.
Why did it run out of gas? Because you didn’t buy any gas on my way to work.
Why didn’t you buy any gas this morning? Because you didn’t have any money.
Why didn’t you have any money? Because you lost it all last night in a poker game.
Why did you lose all your money in a poker game? Because you’re no good at bluffing.

Ok, that’s a fairly trivial example. Sometimes things really are that simple – or even simpler, you might only need two or three why’s to get to the true root cause. (If you find yourself referring to universal constants or physical laws — “because f = ma” — you probably have gone far enough.)

But sometimes things aren’t so linear. Sometimes there is more than one answer to a why. Sometimes, there are multiple why’s to ask of a because.

Five isn’t a magic number; it’s simply to encourage you not to stop at the first, second, or even third level of analysis. Forcing you to go five levels deep should make you go beyond the obvious.

These sorts of complications often arise when doing root cause analysis of complex systems, such as multi-tier software applications.

Don’t be discouraged. And don’t get hung up on finding a tidy linear flow of the questions. Just keep asking why. For example, let’s imagine a root cause analysis for an application that stopped responding:

Why did the application stop responding? Because it couldn’t write to the disk.
Why couldn’t the application write to the disk? Because there was no free disk space.
Why was there no free disk space? Because a process was crashing repeatedly, creating a large dump file each time.

Now, there are several questions which can be asked here. And sometimes, a single question can have multiple answers!

Why was the process crashing? The following could all be valid answers in the same case.

Because the programmer made a mistake that wasn’t caught by the compiler.
Because the programmer made a mistake that wasn’t caught by code review.
Because the environment the software was running in did not conform to assumptions made in the solution design.
Because the instance had not upgraded to a newer version of the software which included a fix for this bug.

And each of these could in turn have multiple why’s. That’s OK. Don’t get wrapped around the axle. Generate a bunch of these!

If you are having a hard time coming up with why’s, consider the system from a different perspective. If you were looking at it from a system administrator’s perspective, consider it from the point of view of:

an end user
an infosec engineer – or an intruder
a business stakeholder who doesn’t use the software but relies on people who do
a tech support engineer
a developer

Using these different perspectives can also be helpful if you are only generating why’s that lead to insights which can’t be practically acted upon – for example, because they are outside the scope of control of any of the involved parties.

If you do successfully generate a large number of why’s, it can be helpful to arrange them in a fishbone or tree diagram to show how they relate to each other.