Want email updates from me?
Want more unvarnished truth?
Looking for something? Look here!
What I'm saying now
What you're saying...
I think tag clouds are pretty, and not to be taken overly seriously
111 Chop House 75 on Liberty Wharf 9/11 A Broth of a Boy ABCs Abiouness accountability activities alcohol Allora Ristorante Analysis Angry Hams ANSI/TIA 942 Anthony's Pier 4 Apple Application Armsby Abbey Arsenal Arturo's Ristorante Ashland AT&T Audio Automation baby Baby Monitor babysitting Back To School Bad News Bangkok Thai Bar Bay State Common baystateparent BBQ BCP BeeZers Before I die I want to... behavior Big Bang Bike Bill of Rights Bistro Black Box BlackBerry Boston Boston Marathon boundaries Boyston BPO brand Breakfast Bridge Bring Your Own Technology Budget Burlington Burn Burrito buyer BYOD Cabling Cambridge Camp Campaign career Casey's Diner Castle casual cCabling Cell Phone Central Square Change Management Cheers Chef Sun ChengDu Chet's Diner Children Chinese Christmas Christmas Families Holiday CIO Cloud coddle collage College College Acceptance co-lo Co-Location Co-Location Tier Power Cooling Comfort Food Control Country Country Kettle Crisis customer dad Dad Phrases damage daredevil Data Center Data Center Design Davios Day Care Dead Death declaration Del Frisco's Design Desktop Video dinner Disaster Recovery Divorce Do Epic Shit dodgeball Downtown Crossing DR driving Droid Easter Economic Kids Edaville Education Elbow Night Elevator Employee Engagement Erin Etiquette Evaluation events Exchange Expiration Dates Facebook Failing family Fatherhood Favorite things Flash Flemings Fogo de Chão Food Hits and Misses Format Foundry on Elm Foxborough Frameworks fraternity Fraud French Fried Clams friends fun Fusion Generations germs Girl Scouts girls Global Go/No Go GPS Grafton Grandchild Grandpa Harry's hazing Healthcare Healthy Choices while Dining Out Help Desk Hisa Japanese Cuisine Historic holiday Home hope Horizons hose Hot Dog Hurricane IIT Assessment incident Indecision Indian Infrastructure Inn Innovation Internet Inventory Management iPhone IT IT Assessment IT Satisfaction Italian Jack Daniels Jakes Restaurant Janet Japanese Jazz Joey's Bar and Grill JP's Khatta Mitha kickball kids Laid off Lakes Region Lala Java Leadership Learning legacy Legal Harborside Les Zygomates L'Espalier Liberty Wharf lights out Linguine's loss Love Lucky's Cafe M&M Macys Thanksgiving Day Parade mai tai Managed Application Services Managed Services managers Mandarin Manners Mark Fidrych marriage Mary Chung mass save Maxwell-Silverman Mediterranean meetings Memorial Day memory Mendon Mergers Mexican MiFi Migration Ming III miss MIT MIT CIO Symposium Mobility Moes Hot Dog Truck MOM money Mother Moving on Name neanderthal neighborhood Network New York Marathon newborn Northborough Not Your Average Joe's Nuovo Nursing On-Call Operations Operators Oregon Club Organization Pancakes Pandemic Parental Control Parenting Patch Peeves People Perserverance UMASS growth Play Plug and Run Predictable Pride Problem Process Production program Project Management propane PTA. PTO PUE QR Quick Response Rant Real Estate Realtor Recognition Red Rock Resiliency Respect restaurant Restaurant Guy RFP ribs Ritual Root Cause Analysis Sam Adams Sandy Sapporo savings School Sea Dog Brewing Company Sea Dog Steak and Ale Seafood Seaport Security Sel de la Terra Service Service Desk Service Indicator Light sharing SHIRO Shit Pump Shriners SHTF Simplification Skunk Works Skype Sleep sleepovers Sloan Smith & Wollensky soccer Son SOP sorority spanking Squarespace staffing Starbucks Status Reporting Steak Steve Jobs Storage Strategy stress Summer Sushi swimming Tacos Acalpulco teacher Technology Teen Telephony Temperature Strip Tenka terrorist Testing Texas BBQ Company Text Thai Thanksgiving in IT The Mooring Thomas Thought Leader Three Gorges III TIA 942 Timesheets Toby Keith Toddlers traditions Transition treehouse turnover TV Twitter unspoken moments Valentine's Day Value Vendor Venezuelan Verizon Vermont Video Vietnamese voice VoIP Watertown Wedding Westborough Korean Restaurant Westborough MA. StormCam WiFI Wi-Fi Wilbraham Wine Worcester work work life balance working Yama Zakura Zem Han Zitis

Entries in Root Cause Analysis (1)

Monday
Jun182012

Documenting Root Cause Analysis

Inevitably in the world of systems something will break and a “Root Cause Analysis (RCA),” “Incident Analysis” or “After Actions” document will need to be written.  Many otherwise capable IT types often freeze at the very thought of documenting an issue, and in this post, we’ll cover an easy format to follow.

Documenting root cause analysis around an incident starts with keeping good notes during an incident.  I jot down the time and any facts I want to remember for later.  Any metrics pertinent to the issue should also be recorded (such as transaction volumes, CPU usage, throughput or impacted systems/users.)

There are four major sections to an RCA document.  We’ll explore each in detail:

  • Executive Summary – This is the high level version of what happened.  Since this goes to executives, and many times is the only thing they’ll read, it needs to be clear, concise, and jargon free.  I find it is useful to assume the executive reading this may not have a technical background, so keeping it high level helps.
    • While this is always the first thing in a RCA document, I find it is often easier to write this last…once all the pertinent facts are understood.
  • Impact - Identify the impact in terms business people can relate to.  Some organization count user outage minutes (number of users x length of outage), “not able to process any orders for 30 minutes”, etc.   Some businesses will sustain minor impact from an outage if their customers are captive (such as online banking being down for a bank.)  Recurring issues will impact business.
  • Timeline – The timeline needs to show the major activities from the beginning of the issue to the resolution/mitigation.  While the notes taken during the event are useful, any log entries in systems, notes in service desk systems, or emails are often useful for time stamping.

Depending on the duration of the issue, the amount of detail included in the timeline will need to be adjusted.  A second by second analysis isn’t needed unless relevant to the issue.

Once the timeline is constructed, review for any improvement opportunities   Large incidents often take time to “declare” because the engineers are looking at individual symptoms and not gaining insight to overall patterns.  There are often very valuable learnings obtained from timeline analysis.

  • Issues – When a vendor is asked for a Root Cause Analysis, they often identify a single topic and the associated root cause.  While important, there are often many issues in a given incident, and executive management will look to the author (and/or team) to provide all issues.

On any given issue, engineers often provide a first order analysis of the issue, and have not identified root cause.  “High CPU” as the root cause for a performance issue is rarely the root issue.

To get to the root cause, one technique is to ask “WHY” five (or more) times.

For example….

Problem:  poor performance

1 Why – High CPU

2 Why – The application was in a loop

3 Why – The database connection was lost, and the application kept retrying

4 Why – The network had an issue

5 Why – Switch supervisor failure

Only when the answers to the “whys” are exhausted will the root cause become apparent and a corrective action plan put into place.

BTW…it’s my experience the most common RCA from a communications carrier is NTF (No Trouble Found.) 

  • Corrective Action Plan/Mitigations With root cause in hand and clarity around the issues, a corrective action plan can be devised.  As with any plan, the task, duration and resource should be identified.   Sometimes the corrective action will be completed, other times it will spawn a project (often related to a budget consideration.)

Tasks from Corrective Action Plans need to be managed like any effort.

It’s very important sufficient time be put into developing the RCA and associated corrective action plans.  These documents have a way of taking on a life of their own, and often find their way into internal or external auditor hands.

Be fully truthful, and not alarmist or inflammatory, in your analysis.

How an organization reacts to a crisis is very important, and the RCA is a big part of it.