Improving your Root Cause Analysis

I’m often asked to help people improve the value of their root cause analysis (RCA). Here’s my most common advice. I include a Root Cause Analysis (RCA) process you can use to improve your software development along with some advice to make your analysis more effective.

Summary

  1. When determining the problem, ask why till you hit a people issue.
  2. Ask what’s the earliest process that could have prevented this. This can be applied to many bugs at once using a drop down in your bug tracking system.
  3. Implement more than your first solution.

Background

Root Cause Analysis is process you use after a production issue to get to the root of it, fix that root cause, and make sure it never happens again. The process can be formal or quick and dirty. It can be done for one issue or many at once. It should never be about finding blame. The ultimate point is for identifying, fixing, and prevention.

Identifying the Root Cause

My favorite technique for determining root cause is the 5 whys technique. You ask why something happened and then you ask why that happened, and so on like a 5-year-old. It would be annoying if it wasn’t so effective.

The number five is arbitrary – it’s meant to be high enough to keep people asking why and not stopping at the first things. Stopping with the Whys too soon is the biggest mistake people make with this technique.

My advice is to keep going till you get past the technical causes to a people cause. No, I don’t mean find someone to blame. While the technical causes are important, there’s often an underlying people issue.

Let me give the most recent example from a client. They had an outage and did a very nice investigation which showed that, while they had been trying to implement Infrastructure as Code (IaC) they had also made changes in production outside of the code. When the deployment happened, the production state was not in a place the code could deal with and the deployment broke things.

Their RCA found this issue – great – but they stopped. My advice is taking it one more step. Why were people not using IaC in production?

  • Is it a training issue where the operational changes are made by people who don’t know the code?
  • Are the people making different changes under different reporting structures that have different processes?
  • Is there an incentive for making changes quickly rather than going through the code path?

There’s a categorization system called the 6 boxes approach which can help with identifying people issues.

Diagram

Description automatically generated with low confidence

The Lippett-Knoster approach for change management is another nice model for identifying the people issues.

Shape

Description automatically generated

Both systems provide a framework to think about what is missing in the way people are working together that could lead to the problem you are seeing.

Fixing the Root Cause

The biggest problem I see out of RCAs is not implementing enough of the fixes! Either the team implements the first or easiest fix or none at all. It’s very common that I see the same root causes suggestions appear over and over on multiple RCAs and that’s just a waste of time. If you’ve found it, fix it!

For each potential fix ask:

  • Is the fix a good idea, or will it just disguise a bad architecture or practice? The latter fixes are not root cause and should be avoided.
  • Are the fixes independent? If two fixes are for the same problem, you may want to just do the best fix and wait to see if that is enough.

Otherwise, my advice is to do as many of the fixes as you can that would have prevented the issue. It seems silly to have to say this but fix it all! There’s a big tendency among people to grab the first or easiest solution. Take the time to do several.

Preventing the Issue

Let’s look at a method to evaluate a whole lot of issues and once, to find a common problem – and prevention – in your organization.

My favorite question for this phase of the RCA is “What’s the earliest process that could have prevented this?”. It might be the most important question in the RCA. Do you really need to know the exact cause if you can figure out a way to prevent it?

There’s a simple way to identify that root cause that gives you real data to fix your team or company’s software development.

  1. Add a dropdown on your bug tracking system – Jira, TFS, Salesforce, whatever. In the dropdown put a bunch of preventions like:
  • Requirements from PM
  • Acceptance demo and test
  • Unit test
  • Component/API test
  • Integration or Resource use test
  • A/B test with real customer data
  • Deployment automation
  • Deployment risk prevention (ex: canary, blue/green, etc.)

You can change these or add your own but try to start with fewer than 10 choices.

2. Review about 100 to 200 high priority bugs in your system. Definitely look at issues in production and escalated from customers, but also consider ones found internally after merge into main that didn’t make it to production.

Rank the bugs with that drop down. This is an ugly manual job and will take a couple of hours, but it’s worth it to do once every 6 months to a year. Don’t farm it out to the dev teams. Just get one to 3 people to go through them all quickly.

3. Look for patterns. Most companies have one or two areas where they are having problems. This exercise gives you the data to prove that those areas need to be fixed.

I’ve seen problems in most of the areas mentioned in the list, except unit testing. Unit testing wasn’t shown to be a great bug preventative for P1 bugs in the few companies I looked at. I think that’s because it’s better at documenting the code and making it maintainable than preventing bugs.

4. Find sub categories and do another pass. Most companies seem to have 1 or 2 major areas where they need help. Unfortunately, the exercise above will have narrowed down the problem but not provided a root cause. It just tells you where to look.

For example, were requirements from PM not translated from epics to stories? We’re they not in the acceptance criteria? Were they not validated with customers? Did they not include the customer, problem from the customer point of view, and outcome the customer would get?

You’ll need a new set of categories and another pass of just the bugs in that category to narrow down your root cause among the available issues. In the spirit of 5 whys, you may need several passes to get enough information to go all the way to root cause. This will sound like a lot of work, and it is, but you are doing a massive RCA for a lot of issues to get real data on where to fix your software development lifecycle.

Copyright © 2021, All Rights Reserved by Bill Hodghead, shared under creative commons license 4.0