Journal - Gary L Kelley

Want more unvarnished truth?

What I'm saying now

What you're saying...

Looking for something? Look here!

Top

I think tag clouds are pretty, and not to be taken overly seriously

Entries in Resiliency (3)

Monday

Oct112010

Can Systems Be Made Resilient?

Monday, October 11, 2010 at 5:07PM

I recently read an article about Google’s new self-driving car, and I was intrigued by a reference to the requirement that the computer hardware and software running the car be completely resistant to failure. In so many words, a “blue screen of death” while in motion would probably lead to a deadly blue screen of death.

I believe hardware and software can be made 100% resilient to failure, so why, as infrastructure professionals, do we never witness a truly resilient system?

Well, we do. A few devices with resilient hardware and software systems are:

Apollo 11
My car

TiVo systems (note, I did not say Comcast or FiOS DVRs)

Calculators

When something works well, we should take the time to examine how to replicate the processes leading to a better deliverable. Everything we do should include “lesson learned,” and every lesson learned should result in a project to make things better. Problems will diminish and our focus will shift to developing value-added business processes, rather than fixing what is broken.

In an article called “The Infrastructure Economics Breakthrough,” in this month’s Wall Street & Technology, Howard Rubin posits that infrastructure professionals have yet to deliver high quality infrastructure for less money. The panacea of “scale” we all talk about has not been attained and we [infrastructure folks] prevent investment in business deliverables that could drive higher profits.

Maybe one of the reasons we can’t optimize the infrastructure is that we are too busy fixing the hardware and software designed to give us an optimized infrastructure. When was the last time you implemented something that didn’t require a fix, patch, or an “enhancement” to get it to work correctly? We all know not to implement a “zero point release” of a product. We will let someone else shake out the bugs, and only then will we consider planning for an implementation. That doesn’t sound productive, and you know someone is going through the pain of being the early adopter.

I have a solution. Let’s ask Sam Palmisano, Larry Ellison, Eric Schmidt, Steve Ballmer, and other hardware/software CEO’s to drive cars with the hardware and software at the same level of quality at which they release their products. I bet they would think twice, and maybe we would see an improved focus on resilient systems.

Gary L Kelley | |

Email Article |

Print Article

1 Check out Comments

Wednesday

May262010

Robust Software

Wednesday, May 26, 2010 at 8:31AM

We use Microsoft Exchange as our email platform and purchase it as a service from a leading (and very large) hosting company. In general, it works well, but they do seem to have many small outages. Our clients also use Microsoft Exchange, but internally hosted. They do not seem to have as many outages.

As we see more-and-more people working from home (be they employees or consultants) we are challenged with maintaining a productive work environment. Downtime from telecom vendors and home PCs, software, and devices is high. All of those represent single points of failure and therefore a potential for outages. Consumers have accepted a lower standard of reliability than companies. How much time is wasted because software does not work properly?

Why should we be concerned with single points of failure?

Businesses invest a large amount of money in high availability to eliminate single points of failure. They do this to avoid user impact from a hardware or software failure. I fully appreciate the need for redundancy for hardware failures, but I can’t understand why we should have to pay for software companies’ inability to build software that will not fail.

Why build robust software?

I would postulate that software companies are not incented to prevent software from failing. We have come to accept a level of failure from all software and have also come to accept our software vendor’s weaknesses. We also accept that we should pay software vendors an annual maintenance fee (around 20%) to fix problems in their software. Shouldn’t they incur the cost of fixing those problems? Is it that they don’t make enough money?

In addition to maintenance fees, we (as buyers) are not willing to pay for more robust software, but we are willing to pay for more features. If you are a software company, where are you going to invest?

What can be done?

This is a wide-open question and one that is very difficult. In many areas quality standards are high or there is regulation protecting us. You don’t hear (too often) about medical devices failing or having bugs. That would be bad. There is now talk of more rigid testing standards for on-board computers in cars (post Toyota debacle). This is a good thing.

Software defects are only tracked internally by vendors. Wouldn’t it be great to see the count posted on their web site? Wouldn’t it also be great for software companies to admit when they have a problem? When was the last time Microsoft admitted it had a bug in a product. I remember working for a hardware vendor (years ago) where the acceptable level of bugs for release of an operating system was 10,000! Bringing transparency and awareness to the issue will help to make software companies accountable.

As consumers, all we can do is vote with our dollars. Unfortunately, our choices are limited. There is little, real, competition in the software industry.

There is one solution, Linux. Linux software is very robust, efficient, and functional. The reason for this is simple. Development and selection of features is a community process driven by consumers. Who wouldn’t put robustness, efficiency (small footprint), and cost as high priorities. They can’t be captured in a screen shot, but they make our lives more productive.

Matt Ferm | |

Email Article |

Print Article

Fear Factor

Tuesday, October 20, 2009 at 6:09PM

It’s 4:00AM and your primary storage array just failed. That’s particularly concerning in a financial services company. Fixed Income traders start working at 7:00AM. You convene your incident management team and the decision is to failover. You inform your system and database administrators and the answer is that it might be riskier to fail over than to fix the problem.

Argh!

We invest millions of dollars in redundant systems to achieve high availability or disaster recovery and in some companies we are too scared to use them. We spend years perfecting software configurations and hardware clusters for failover but never feel comfortable enough to initiate a failover. Hardware and Software companies sell us new technologies with the promise of “5-nines” reliability, but nobody has factored in human emotion and fear of failure.

I was always very proud of my messaging (email, instant messaging, unified communication, etc.) team with regard to their use of clusters and failover. They convinced me the email system could be made highly available with minimal time to failover, data loss, and time to failback, but what really impressed me was how they used these capabilities on a daily basis. They changed their release management processes to include failover of the active server thus eliminate all downtime. They practiced failing over under different scenarios and always knew exactly how much time it would take (3 minutes) to bring up the passive node. Every year, they found ways to enhance those capabilities and leverage what they had learned towards improving our disaster recovery capabilities. When we decided to distribute our centralized email system to our global offices, they gave management the option of local failover, failover to our primary data center, or both. In short, they got it!

As a manager, you want to believe in your people, vendors, and the technologies choices you have chosen for your infrastructure. The truth is, you can believe, so long as you help people get over their fear. The messaging team used the technologies as part of their operational routines. Things matured to the point, where they might failover a server or virtual server at their own discretion. They earned the trust of the management team. I remember the day the entire IT management team was on a conference call counting down the minutes for the first production failover of the server. After that, it was no longer an event.

Years ago, we ran our investment business on two, large (for their time), super mini-computers. All data, programs, and systems management tools (schedulers, backup, monitoring) were fully replicated on each system and our computer operators load balanced between the systems. Even though it required manual intervention, it was our earliest form of high availability, and it worked.

High availability architectures are beneficial, so long as they are used frequently. Allowing them to sit idle deepens the level of fear within the support staff. Using them gives your staff the opportunity for improvement. Making failover part of your operational procedures reduces the need for Disaster Recovery testing, improves uptime, and can ease the pain of release management.

The goal of all Infrastructure & Operations organizations is 100% uptime. Exercising your high-availability architectures will bring your systems close to this goal and improve the performance of your staff.

Matt Ferm | |

Email Article |

Print Article