Gary L Kelley - Journal

It’s 4:00AM and your primary storage array just failed. That’s particularly concerning in a financial services company. Fixed Income traders start working at 7:00AM. You convene your incident management team and the decision is to failover. You inform your system and database administrators and the answer is that it might be riskier to fail over than to fix the problem.

Argh!

We invest millions of dollars in redundant systems to achieve high availability or disaster recovery and in some companies we are too scared to use them. We spend years perfecting software configurations and hardware clusters for failover but never feel comfortable enough to initiate a failover. Hardware and Software companies sell us new technologies with the promise of “5-nines” reliability, but nobody has factored in human emotion and fear of failure.

I was always very proud of my messaging (email, instant messaging, unified communication, etc.) team with regard to their use of clusters and failover. They convinced me the email system could be made highly available with minimal time to failover, data loss, and time to failback, but what really impressed me was how they used these capabilities on a daily basis. They changed their release management processes to include failover of the active server thus eliminate all downtime. They practiced failing over under different scenarios and always knew exactly how much time it would take (3 minutes) to bring up the passive node. Every year, they found ways to enhance those capabilities and leverage what they had learned towards improving our disaster recovery capabilities. When we decided to distribute our centralized email system to our global offices, they gave management the option of local failover, failover to our primary data center, or both. In short, they got it!

As a manager, you want to believe in your people, vendors, and the technologies choices you have chosen for your infrastructure. The truth is, you can believe, so long as you help people get over their fear. The messaging team used the technologies as part of their operational routines. Things matured to the point, where they might failover a server or virtual server at their own discretion. They earned the trust of the management team. I remember the day the entire IT management team was on a conference call counting down the minutes for the first production failover of the server. After that, it was no longer an event.

Years ago, we ran our investment business on two, large (for their time), super mini-computers. All data, programs, and systems management tools (schedulers, backup, monitoring) were fully replicated on each system and our computer operators load balanced between the systems. Even though it required manual intervention, it was our earliest form of high availability, and it worked.

High availability architectures are beneficial, so long as they are used frequently. Allowing them to sit idle deepens the level of fear within the support staff. Using them gives your staff the opportunity for improvement. Making failover part of your operational procedures reduces the need for Disaster Recovery testing, improves uptime, and can ease the pain of release management.

The goal of all Infrastructure & Operations organizations is 100% uptime. Exercising your high-availability architectures will bring your systems close to this goal and improve the performance of your staff.