Want email updates from me?
Want more unvarnished truth?
Looking for something? Look here!
What I'm saying now
What you're saying...
I think tag clouds are pretty, and not to be taken overly seriously
111 Chop House 75 on Liberty Wharf 9/11 A Broth of a Boy ABCs Abiouness accountability activities alcohol Allora Ristorante Analysis Angry Hams ANSI/TIA 942 Anthony's Pier 4 Apple Application Armsby Abbey Arsenal Arturo's Ristorante Ashland AT&T Audio Automation baby Baby Monitor babysitting Back To School Bad News Bangkok Thai Bar Bay State Common baystateparent BBQ BCP BeeZers Before I die I want to... behavior Big Bang Bike Bill of Rights Bistro Black Box BlackBerry Boston Boston Marathon boundaries Boyston BPO brand Breakfast Bridge Bring Your Own Technology Budget Burlington Burn Burrito buyer BYOD Cabling Cambridge Camp Campaign career Casey's Diner Castle casual cCabling Cell Phone Central Square Change Management Cheers Chef Sun ChengDu Chet's Diner Children Chinese Christmas Christmas Families Holiday CIO Cloud coddle collage College College Acceptance co-lo Co-Location Co-Location Tier Power Cooling Comfort Food Control Country Country Kettle Crisis customer dad Dad Phrases damage daredevil Data Center Data Center Design Davios Day Care Dead Death declaration Del Frisco's Design Desktop Video dinner Disaster Recovery Divorce Do Epic Shit dodgeball Downtown Crossing DR driving Droid Easter Economic Kids Edaville Education Elbow Night Elevator Employee Engagement Erin Etiquette Evaluation events Exchange Expiration Dates Facebook Failing family Fatherhood Favorite things Flash Flemings Fogo de Chão Food Hits and Misses Format Foundry on Elm Foxborough Frameworks fraternity Fraud French Fried Clams friends fun Fusion Generations germs Girl Scouts girls Global Go/No Go GPS Grafton Grandchild Grandpa Harry's hazing Healthcare Healthy Choices while Dining Out Help Desk Hisa Japanese Cuisine Historic holiday Home hope Horizons hose Hot Dog Hurricane IIT Assessment incident Indecision Indian Infrastructure Inn Innovation Internet Inventory Management iPhone IT IT Assessment IT Satisfaction Italian Jack Daniels Jakes Restaurant Janet Japanese Jazz Joey's Bar and Grill JP's Khatta Mitha kickball kids Laid off Lakes Region Lala Java Leadership Learning legacy Legal Harborside Les Zygomates L'Espalier Liberty Wharf lights out Linguine's loss Love Lucky's Cafe M&M Macys Thanksgiving Day Parade mai tai Managed Application Services Managed Services managers Mandarin Manners Mark Fidrych marriage Mary Chung mass save Maxwell-Silverman Mediterranean meetings Memorial Day memory Mendon Mergers Mexican MiFi Migration Ming III miss MIT MIT CIO Symposium Mobility Moes Hot Dog Truck MOM money Mother Moving on Name neanderthal neighborhood Network New York Marathon newborn Northborough Not Your Average Joe's Nuovo Nursing On-Call Operations Operators Oregon Club Organization Pancakes Pandemic Parental Control Parenting Patch Peeves People Perserverance UMASS growth Play Plug and Run Predictable Pride Problem Process Production program Project Management propane PTA. PTO PUE QR Quick Response Rant Real Estate Realtor Recognition Red Rock Resiliency Respect restaurant Restaurant Guy RFP ribs Ritual Root Cause Analysis Sam Adams Sandy Sapporo savings School Sea Dog Brewing Company Sea Dog Steak and Ale Seafood Seaport Security Sel de la Terra Service Service Desk Service Indicator Light sharing SHIRO Shit Pump Shriners SHTF Simplification Skunk Works Skype Sleep sleepovers Sloan Smith & Wollensky soccer Son SOP sorority spanking Squarespace staffing Starbucks Status Reporting Steak Steve Jobs Storage Strategy stress Summer Sushi swimming Tacos Acalpulco teacher Technology Teen Telephony Temperature Strip Tenka terrorist Testing Texas BBQ Company Text Thai Thanksgiving in IT The Mooring Thomas Thought Leader Three Gorges III TIA 942 Timesheets Toby Keith Toddlers traditions Transition treehouse turnover TV Twitter unspoken moments Valentine's Day Value Vendor Venezuelan Verizon Vermont Video Vietnamese voice VoIP Watertown Wedding Westborough Korean Restaurant Westborough MA. StormCam WiFI Wi-Fi Wilbraham Wine Worcester work work life balance working Yama Zakura Zem Han Zitis

Entries in incident (3)

Monday
May062013

The Importance of Staff & Shifts

In the course of our business, we see many data center/applications migrations and/or high-severity issues.  One observation we always share with our clients is to plan for staff rotation.  As you might expect, some listen and others do not. Here’s why it’s important.

Migrations often happen overnight…when the business sleeps or operates at a lower activity level.  Organizations without satisfactory disaster recovery plans often incur an outage to do a migration.  People are resilient for so many hours, and then they crash. 

What often happens in migrations is everyone wants to be at the starting line, and the adrenaline keeps them engaged.  If shifts are not “forced,” then there is often nobody left with “gas in their tank” to troubleshoot issues.  People simply have to disengage to be fresh.

We saw this at a large customer where the team had persevered, declared success, and then dragged themselves home.  There was an issue, and the on-call was unwilling to make changes as he didn’t understand the changes that had taken place (a change management issue.)  NOBODY involved was responding to calls.  As it turned out, the group’s manager lived in my town, and I got to knock on his door at 10:00AM on a Sunday morning.  His wife wasn’t happy (he had been up all night) and did indeed get him up.  While he resolved the issue, a few months later he resigned and went to work at a different company. 

In this case, the team was not structured to focus on a multiple day issue….and response was poor.

In another case, a new virus definitions in client’s antivirus system determined the operating system was bad, quarantining the operating system.  The client had a policy to delete quarantined files, so with the speed of automation thousands of operating systems were deleted.

The senior manager quickly determined this would require a sustained 24/7 response, and teams were “nominated” to cover 12 hour shifts.  We were asked to help on a sustained basis, providing process oversight and helping with crisply doing turnovers.

To the credit of the senior manager, this approach allowed a sustained response as systems we recovered from (gasp!) tape.

Large IT shops often run with multiple shifts and a technical response is more organic.  Smaller shops tend to have an operational capability 24x7, and may lack the detailed technical response.

When planning or reacting to major events, think in terms of how to rotate your staff for a sustained time.

Monday
Jun182012

Documenting Root Cause Analysis

Inevitably in the world of systems something will break and a “Root Cause Analysis (RCA),” “Incident Analysis” or “After Actions” document will need to be written.  Many otherwise capable IT types often freeze at the very thought of documenting an issue, and in this post, we’ll cover an easy format to follow.

Documenting root cause analysis around an incident starts with keeping good notes during an incident.  I jot down the time and any facts I want to remember for later.  Any metrics pertinent to the issue should also be recorded (such as transaction volumes, CPU usage, throughput or impacted systems/users.)

There are four major sections to an RCA document.  We’ll explore each in detail:

  • Executive Summary – This is the high level version of what happened.  Since this goes to executives, and many times is the only thing they’ll read, it needs to be clear, concise, and jargon free.  I find it is useful to assume the executive reading this may not have a technical background, so keeping it high level helps.
    • While this is always the first thing in a RCA document, I find it is often easier to write this last…once all the pertinent facts are understood.
  • Impact - Identify the impact in terms business people can relate to.  Some organization count user outage minutes (number of users x length of outage), “not able to process any orders for 30 minutes”, etc.   Some businesses will sustain minor impact from an outage if their customers are captive (such as online banking being down for a bank.)  Recurring issues will impact business.
  • Timeline – The timeline needs to show the major activities from the beginning of the issue to the resolution/mitigation.  While the notes taken during the event are useful, any log entries in systems, notes in service desk systems, or emails are often useful for time stamping.

Depending on the duration of the issue, the amount of detail included in the timeline will need to be adjusted.  A second by second analysis isn’t needed unless relevant to the issue.

Once the timeline is constructed, review for any improvement opportunities   Large incidents often take time to “declare” because the engineers are looking at individual symptoms and not gaining insight to overall patterns.  There are often very valuable learnings obtained from timeline analysis.

  • Issues – When a vendor is asked for a Root Cause Analysis, they often identify a single topic and the associated root cause.  While important, there are often many issues in a given incident, and executive management will look to the author (and/or team) to provide all issues.

On any given issue, engineers often provide a first order analysis of the issue, and have not identified root cause.  “High CPU” as the root cause for a performance issue is rarely the root issue.

To get to the root cause, one technique is to ask “WHY” five (or more) times.

For example….

Problem:  poor performance

1 Why – High CPU

2 Why – The application was in a loop

3 Why – The database connection was lost, and the application kept retrying

4 Why – The network had an issue

5 Why – Switch supervisor failure

Only when the answers to the “whys” are exhausted will the root cause become apparent and a corrective action plan put into place.

BTW…it’s my experience the most common RCA from a communications carrier is NTF (No Trouble Found.) 

  • Corrective Action Plan/Mitigations With root cause in hand and clarity around the issues, a corrective action plan can be devised.  As with any plan, the task, duration and resource should be identified.   Sometimes the corrective action will be completed, other times it will spawn a project (often related to a budget consideration.)

Tasks from Corrective Action Plans need to be managed like any effort.

It’s very important sufficient time be put into developing the RCA and associated corrective action plans.  These documents have a way of taking on a life of their own, and often find their way into internal or external auditor hands.

Be fully truthful, and not alarmist or inflammatory, in your analysis.

How an organization reacts to a crisis is very important, and the RCA is a big part of it.

Monday
Dec052011

When to Declare an Incident

I was sitting with a client recently having a project discussion when a problem with email was called in.

No big deal.  Large companies have problems every day.

A few minutes later another interruption, this time for phones in a remote office.  OK, this is why we have staff.

And then another issue, this time with remote access.

The client was calmly processing these facts, and continuing the conversation.  Perhaps they felt obligated with me being there.  I interrupted, and said, “All these problems.  Something larger must be going on.  Should you declare an incident?”

Large IT shops are very familiar with large scale incidents, and have well-honed incident management processes.  When there are major issues (like a total processing failure, or a major business impacting outage) the incident processes are automatically implemented.

It’s incidents in the grey where sometimes companies are hesitant invoke the incident management process, often involving many people with full notifications to the business.

While there are books written around how to do incident management, when to declare an incident isn’t uniformly understood.  Why?  Because in IT every day we have a multiple of issues dealt with in the normal course of business.  Declaring an incident is often viewed as a “big deal.”

It is a big deal.  The best and brightest drop what they are doing and focus on the issue. 

The catastrophic failures are easy…instant incident.

On others, often tickets are being opened on help desk(s) and routed for resolution.  If you’ve ever been around a help desk, you know if one minute they are busy and the next they are slammed….there is something going on.

When considering declaring an incident, POTENTIAL BUSINESS IMPACT is my metric.

So, one desktop down is important, 100 desktops down in a call center is a big deal.

A one person sales office having a phone issue is bad, and a 250 person HQ being without phones is really bad.

The way to minimize business impacts is through contingency planning.  If one facility is down, the business/traffic is routed somewhere else. 

Large companies do this as a matter of course; smaller companies often don’t feel they have the mass to successfully pull it off.

Declaring an incident should be celebrated as a way to get others to help quickly mitigate business impacts.

When have you “declared an incident?”