Documenting Root Cause Analysis
Monday, June 18, 2012 at 8:00AM
Gary L Kelley in IT, Root Cause Analysis, incident

Inevitably in the world of systems something will break and a “Root Cause Analysis (RCA),” “Incident Analysis” or “After Actions” document will need to be written.  Many otherwise capable IT types often freeze at the very thought of documenting an issue, and in this post, we’ll cover an easy format to follow.

Documenting root cause analysis around an incident starts with keeping good notes during an incident.  I jot down the time and any facts I want to remember for later.  Any metrics pertinent to the issue should also be recorded (such as transaction volumes, CPU usage, throughput or impacted systems/users.)

There are four major sections to an RCA document.  We’ll explore each in detail:

Depending on the duration of the issue, the amount of detail included in the timeline will need to be adjusted.  A second by second analysis isn’t needed unless relevant to the issue.

Once the timeline is constructed, review for any improvement opportunities   Large incidents often take time to “declare” because the engineers are looking at individual symptoms and not gaining insight to overall patterns.  There are often very valuable learnings obtained from timeline analysis.

On any given issue, engineers often provide a first order analysis of the issue, and have not identified root cause.  “High CPU” as the root cause for a performance issue is rarely the root issue.

To get to the root cause, one technique is to ask “WHY” five (or more) times.

For example….

Problem:  poor performance

1 Why – High CPU

2 Why – The application was in a loop

3 Why – The database connection was lost, and the application kept retrying

4 Why – The network had an issue

5 Why – Switch supervisor failure

Only when the answers to the “whys” are exhausted will the root cause become apparent and a corrective action plan put into place.

BTW…it’s my experience the most common RCA from a communications carrier is NTF (No Trouble Found.) 

Tasks from Corrective Action Plans need to be managed like any effort.

It’s very important sufficient time be put into developing the RCA and associated corrective action plans.  These documents have a way of taking on a life of their own, and often find their way into internal or external auditor hands.

Be fully truthful, and not alarmist or inflammatory, in your analysis.

How an organization reacts to a crisis is very important, and the RCA is a big part of it.

Article originally appeared on Gary L Kelley (http://garylkelley.com/).
See website for complete article licensing information.