Want email updates from me?
Want more unvarnished truth?
Looking for something? Look here!
What I'm saying now
What you're saying...
I think tag clouds are pretty, and not to be taken overly seriously
111 Chop House 75 on Liberty Wharf 9/11 A Broth of a Boy ABCs Abiouness accountability activities alcohol Allora Ristorante Analysis Angry Hams ANSI/TIA 942 Anthony's Pier 4 Apple Application Armsby Abbey Arsenal Arturo's Ristorante Ashland AT&T Audio Automation baby Baby Monitor babysitting Back To School Bad News Bangkok Thai Bar Bay State Common baystateparent BBQ BCP BeeZers Before I die I want to... behavior Big Bang Bike Bill of Rights Bistro Black Box BlackBerry Boston Boston Marathon boundaries Boyston BPO brand Breakfast Bridge Bring Your Own Technology Budget Burlington Burn Burrito buyer BYOD Cabling Cambridge Camp Campaign career Casey's Diner Castle casual cCabling Cell Phone Central Square Change Management Cheers Chef Sun ChengDu Chet's Diner Children Chinese Christmas Christmas Families Holiday CIO Cloud coddle collage College College Acceptance co-lo Co-Location Co-Location Tier Power Cooling Comfort Food Control Country Country Kettle Crisis customer dad Dad Phrases damage daredevil Data Center Data Center Design Davios Day Care Dead Death declaration Del Frisco's Design Desktop Video dinner Disaster Recovery Divorce Do Epic Shit dodgeball Downtown Crossing DR driving Droid Easter Economic Kids Edaville Education Elbow Night Elevator Employee Engagement Erin Etiquette Evaluation events Exchange Expiration Dates Facebook Failing family Fatherhood Favorite things Flash Flemings Fogo de Chão Food Hits and Misses Format Foundry on Elm Foxborough Frameworks fraternity Fraud French Fried Clams friends fun Fusion Generations germs Girl Scouts girls Global Go/No Go GPS Grafton Grandchild Grandpa Harry's hazing Healthcare Healthy Choices while Dining Out Help Desk Hisa Japanese Cuisine Historic holiday Home hope Horizons hose Hot Dog Hurricane IIT Assessment incident Indecision Indian Infrastructure Inn Innovation Internet Inventory Management iPhone IT IT Assessment IT Satisfaction Italian Jack Daniels Jakes Restaurant Janet Japanese Jazz Joey's Bar and Grill JP's Khatta Mitha kickball kids Laid off Lakes Region Lala Java Leadership Learning legacy Legal Harborside Les Zygomates L'Espalier Liberty Wharf lights out Linguine's loss Love Lucky's Cafe M&M Macys Thanksgiving Day Parade mai tai Managed Application Services Managed Services managers Mandarin Manners Mark Fidrych marriage Mary Chung mass save Maxwell-Silverman Mediterranean meetings Memorial Day memory Mendon Mergers Mexican MiFi Migration Ming III miss MIT MIT CIO Symposium Mobility Moes Hot Dog Truck MOM money Mother Moving on Name neanderthal neighborhood Network New York Marathon newborn Northborough Not Your Average Joe's Nuovo Nursing On-Call Operations Operators Oregon Club Organization Pancakes Pandemic Parental Control Parenting Patch Peeves People Perserverance UMASS growth Play Plug and Run Predictable Pride Problem Process Production program Project Management propane PTA. PTO PUE QR Quick Response Rant Real Estate Realtor Recognition Red Rock Resiliency Respect restaurant Restaurant Guy RFP ribs Ritual Root Cause Analysis Sam Adams Sandy Sapporo savings School Sea Dog Brewing Company Sea Dog Steak and Ale Seafood Seaport Security Sel de la Terra Service Service Desk Service Indicator Light sharing SHIRO Shit Pump Shriners SHTF Simplification Skunk Works Skype Sleep sleepovers Sloan Smith & Wollensky soccer Son SOP sorority spanking Squarespace staffing Starbucks Status Reporting Steak Steve Jobs Storage Strategy stress Summer Sushi swimming Tacos Acalpulco teacher Technology Teen Telephony Temperature Strip Tenka terrorist Testing Texas BBQ Company Text Thai Thanksgiving in IT The Mooring Thomas Thought Leader Three Gorges III TIA 942 Timesheets Toby Keith Toddlers traditions Transition treehouse turnover TV Twitter unspoken moments Valentine's Day Value Vendor Venezuelan Verizon Vermont Video Vietnamese voice VoIP Watertown Wedding Westborough Korean Restaurant Westborough MA. StormCam WiFI Wi-Fi Wilbraham Wine Worcester work work life balance working Yama Zakura Zem Han Zitis

Entries in Resiliency (3)

Monday
Oct112010

Can Systems Be Made Resilient?

I recently read an article about Google’s new self-driving car, and I was intrigued by a reference to the requirement that the computer hardware and software running the car be completely resistant to failure. In so many words, a “blue screen of death” while in motion would probably lead to a deadly blue screen of death.



I believe hardware and software can be made 100% resilient to failure, so why, as infrastructure professionals, do we never witness a truly resilient system?

Well, we do. A few devices with resilient hardware and software systems are:


  • Apollo 11

  • My car

  • TiVo systems (note, I did not say Comcast or FiOS DVRs)

  • Calculators

When something works well, we should take the time to examine how to replicate the processes leading to a better deliverable. Everything we do should include “lesson learned,” and every lesson learned should result in a project to make things better. Problems will diminish and our focus will shift to developing value-added business processes, rather than fixing what is broken.

In an article called “The Infrastructure Economics Breakthrough,” in this month’s Wall Street & Technology, Howard Rubin posits that infrastructure professionals have yet to deliver high quality infrastructure for less money. The panacea of “scale” we all talk about has not been attained and we [infrastructure folks] prevent investment in business deliverables that could drive higher profits.

Maybe one of the reasons we can’t optimize the infrastructure is that we are too busy fixing the hardware and software designed to give us an optimized infrastructure. When was the last time you implemented something that didn’t require a fix, patch, or an “enhancement” to get it to work correctly? We all know not to implement a “zero point release” of a product. We will let someone else shake out the bugs, and only then will we consider planning for an implementation. That doesn’t sound productive, and you know someone is going through the pain of being the early adopter.

I have a solution. Let’s ask Sam Palmisano, Larry Ellison, Eric Schmidt, Steve Ballmer, and other hardware/software CEO’s to drive cars with the hardware and software at the same level of quality at which they release their products. I bet they would think twice, and maybe we would see an improved focus on resilient systems.

Wednesday
May262010

Robust Software

We use Microsoft Exchange as our email platform and purchase it as a service from a leading (and very large) hosting company. In general, it works well, but they do seem to have many small outages. Our clients also use Microsoft Exchange, but internally hosted. They do not seem to have as many outages.

As we see more-and-more people working from home (be they employees or consultants) we are challenged with maintaining a productive work environment. Downtime from telecom vendors and home PCs, software, and devices is high. All of those represent single points of failure and therefore a potential for outages. Consumers have accepted a lower standard of reliability than companies. How much time is wasted because software does not work properly?

Why should we be concerned with single points of failure?

Businesses invest a large amount of money in high availability to eliminate single points of failure. They do this to avoid user impact from a hardware or software failure. I fully appreciate the need for redundancy for hardware failures, but I can’t understand why we should have to pay for software companies’ inability to build software that will not fail.

Why build robust software?

I would postulate that software companies are not incented to prevent software from failing. We have come to accept a level of failure from all software and have also come to accept our software vendor’s weaknesses. We also accept that we should pay software vendors an annual maintenance fee (around 20%) to fix problems in their software. Shouldn’t they incur the cost of fixing those problems? Is it that they don’t make enough money?

In addition to maintenance fees, we (as buyers) are not willing to pay for more robust software, but we are willing to pay for more features. If you are a software company, where are you going to invest?

What can be done?

This is a wide-open question and one that is very difficult. In many areas quality standards are high or there is regulation protecting us. You don’t hear (too often) about medical devices failing or having bugs. That would be bad. There is now talk of more rigid testing standards for on-board computers in cars (post Toyota debacle). This is a good thing.

Software defects are only tracked internally by vendors. Wouldn’t it be great to see the count posted on their web site? Wouldn’t it also be great for software companies to admit when they have a problem? When was the last time Microsoft admitted it had a bug in a product. I remember working for a hardware vendor (years ago) where the acceptable level of bugs for release of an operating system was 10,000! Bringing transparency and awareness to the issue will help to make software companies accountable.

As consumers, all we can do is vote with our dollars. Unfortunately, our choices are limited. There is little, real, competition in the software industry.

There is one solution, Linux. Linux software is very robust, efficient, and functional. The reason for this is simple. Development and selection of features is a community process driven by consumers. Who wouldn’t put robustness, efficiency (small footprint), and cost as high priorities. They can’t be captured in a screen shot, but they make our lives more productive.

Tuesday
Oct202009

Fear Factor

It’s 4:00AM and your primary storage array just failed. That’s particularly concerning in a financial services company. Fixed Income traders start working at 7:00AM. You convene your incident management team and the decision is to failover. You inform your system and database administrators and the answer is that it might be riskier to fail over than to fix the problem.

Argh!


We invest millions of dollars in redundant systems to achieve high availability or disaster recovery and in some companies we are too scared to use them. We spend years perfecting software configurations and hardware clusters for failover but never feel comfortable enough to initiate a failover. Hardware and Software companies sell us new technologies with the promise of “5-nines” reliability, but nobody has factored in human emotion and fear of failure.

I was always very proud of my messaging (email, instant messaging, unified communication, etc.) team with regard to their use of clusters and failover. They convinced me the email system could be made highly available with minimal time to failover, data loss, and time to failback, but what really impressed me was how they used these capabilities on a daily basis. They changed their release management processes to include failover of the active server thus eliminate all downtime. They practiced failing over under different scenarios and always knew exactly how much time it would take (3 minutes) to bring up the passive node. Every year, they found ways to enhance those capabilities and leverage what they had learned towards improving our disaster recovery capabilities. When we decided to distribute our centralized email system to our global offices, they gave management the option of local failover, failover to our primary data center, or both. In short, they got it!

As a manager, you want to believe in your people, vendors, and the technologies choices you have chosen for your infrastructure. The truth is, you can believe, so long as you help people get over their fear. The messaging team used the technologies as part of their operational routines. Things matured to the point, where they might failover a server or virtual server at their own discretion. They earned the trust of the management team. I remember the day the entire IT management team was on a conference call counting down the minutes for the first production failover of the server. After that, it was no longer an event.

Years ago, we ran our investment business on two, large (for their time), super mini-computers. All data, programs, and systems management tools (schedulers, backup, monitoring) were fully replicated on each system and our computer operators load balanced between the systems. Even though it required manual intervention, it was our earliest form of high availability, and it worked.

High availability architectures are beneficial, so long as they are used frequently. Allowing them to sit idle deepens the level of fear within the support staff. Using them gives your staff the opportunity for improvement. Making failover part of your operational procedures reduces the need for Disaster Recovery testing, improves uptime, and can ease the pain of release management.

The goal of all Infrastructure & Operations organizations is 100% uptime. Exercising your high-availability architectures will bring your systems close to this goal and improve the performance of your staff.