Want email updates from me?
Want more unvarnished truth?
Looking for something? Look here!
What I'm saying now
What you're saying...
I think tag clouds are pretty, and not to be taken overly seriously
111 Chop House 75 on Liberty Wharf 9/11 A Broth of a Boy ABCs Abiouness accountability activities alcohol Allora Ristorante Analysis Angry Hams ANSI/TIA 942 Anthony's Pier 4 Apple Application Armsby Abbey Arsenal Arturo's Ristorante Ashland AT&T Audio Automation baby Baby Monitor babysitting Back To School Bad News Bangkok Thai Bar Bay State Common baystateparent BBQ BCP BeeZers Before I die I want to... behavior Big Bang Bike Bill of Rights Bistro Black Box BlackBerry Boston Boston Marathon boundaries Boyston BPO brand Breakfast Bridge Bring Your Own Technology Budget Burlington Burn Burrito buyer BYOD Cabling Cambridge Camp Campaign career Casey's Diner Castle casual cCabling Cell Phone Central Square Change Management Cheers Chef Sun ChengDu Chet's Diner Children Chinese Christmas Christmas Families Holiday CIO Cloud coddle collage College College Acceptance co-lo Co-Location Co-Location Tier Power Cooling Comfort Food Control Country Country Kettle Crisis customer dad Dad Phrases damage daredevil Data Center Data Center Design Davios Day Care Dead Death declaration Del Frisco's Design Desktop Video dinner Disaster Recovery Divorce Do Epic Shit dodgeball Downtown Crossing DR driving Droid Easter Economic Kids Edaville Education Elbow Night Elevator Employee Engagement Erin Etiquette Evaluation events Exchange Expiration Dates Facebook Failing family Fatherhood Favorite things Flash Flemings Fogo de Chão Food Hits and Misses Format Foundry on Elm Foxborough Frameworks fraternity Fraud French Fried Clams friends fun Fusion Generations germs Girl Scouts girls Global Go/No Go GPS Grafton Grandchild Grandpa Harry's hazing Healthcare Healthy Choices while Dining Out Help Desk Hisa Japanese Cuisine Historic holiday Home hope Horizons hose Hot Dog Hurricane IIT Assessment incident Indecision Indian Infrastructure Inn Innovation Internet Inventory Management iPhone IT IT Assessment IT Satisfaction Italian Jack Daniels Jakes Restaurant Janet Japanese Jazz Joey's Bar and Grill JP's Khatta Mitha kickball kids Laid off Lakes Region Lala Java Leadership Learning legacy Legal Harborside Les Zygomates L'Espalier Liberty Wharf lights out Linguine's loss Love Lucky's Cafe M&M Macys Thanksgiving Day Parade mai tai Managed Application Services Managed Services managers Mandarin Manners Mark Fidrych marriage Mary Chung mass save Maxwell-Silverman Mediterranean meetings Memorial Day memory Mendon Mergers Mexican MiFi Migration Ming III miss MIT MIT CIO Symposium Mobility Moes Hot Dog Truck MOM money Mother Moving on Name neanderthal neighborhood Network New York Marathon newborn Northborough Not Your Average Joe's Nuovo Nursing On-Call Operations Operators Oregon Club Organization Pancakes Pandemic Parental Control Parenting Patch Peeves People Perserverance UMASS growth Play Plug and Run Predictable Pride Problem Process Production program Project Management propane PTA. PTO PUE QR Quick Response Rant Real Estate Realtor Recognition Red Rock Resiliency Respect restaurant Restaurant Guy RFP ribs Ritual Root Cause Analysis Sam Adams Sandy Sapporo savings School Sea Dog Brewing Company Sea Dog Steak and Ale Seafood Seaport Security Sel de la Terra Service Service Desk Service Indicator Light sharing SHIRO Shit Pump Shriners SHTF Simplification Skunk Works Skype Sleep sleepovers Sloan Smith & Wollensky soccer Son SOP sorority spanking Squarespace staffing Starbucks Status Reporting Steak Steve Jobs Storage Strategy stress Summer Sushi swimming Tacos Acalpulco teacher Technology Teen Telephony Temperature Strip Tenka terrorist Testing Texas BBQ Company Text Thai Thanksgiving in IT The Mooring Thomas Thought Leader Three Gorges III TIA 942 Timesheets Toby Keith Toddlers traditions Transition treehouse turnover TV Twitter unspoken moments Valentine's Day Value Vendor Venezuelan Verizon Vermont Video Vietnamese voice VoIP Watertown Wedding Westborough Korean Restaurant Westborough MA. StormCam WiFI Wi-Fi Wilbraham Wine Worcester work work life balance working Yama Zakura Zem Han Zitis

Entries in Disaster Recovery (4)

Saturday
Nov032012

IT Operations – The Unsung Heroes

This is a story of how one company and its operations staff kept the lights on in the face of post-Hurricane in New York City.

Three blogging websites are under my general control:

  • Garylkelley.com – a site about technology, fatherhood and restaurant reviews
  • Curriculotta.com – an “alter ego” site for the properly starched shirts over at Harvard Partners
  • Markfidrychfoundation.org – a site for furthering the community work of the late ball player Mark Fidrych.

Squarespace is used to “host” these sites.  WordPress was our original choice, and served us for a while.  I was just never a fan for how WordPress “thinks.”  Personally, I prefer Squarespace.

Squarespace uses PEER1 as their co-location provider, located in New York at 75 Broad Street.

You can imagine my personal dismay when Tuesday, October 30 at 11:34AM I got the following message from Squarespace:

I have some unfortunate news to share. Our primary data center, Peer1, in Lower Manhattan lost power yesterday at about 4:30PM local time. At that time, we smoothly made the transition to generator power and took comfort over the fact that we had enough fuel to last three to four days. (Peer1 stayed online during the last 3 major natural disasters in the area, including a blackout that lasted for days.)

At 8:30PM yesterday, we received reports that the lobby in the data center’s building was beginning to take on water. By 10:30PM, as is sadly the case in most of Lower Manhattan, Peer1’s basement had experienced serious flooding. At 5AM, we learned our data center’s fuel pumps and fuel tanks were completely flooded and unable to deliver any more fuel. At 8AM, they reported that the generators would be able to run for a maximum of four more hours.

Unfortunately, this means that Squarespace will be offline soon (our estimate being at 10:45 AM today).

I then did what any IT ops person would do…and notified my users of this outage:


Of course, I then did what any user would do, and emailed Squarespace support (like they had time for me.)

Can you guys toss up a graphic of some kind so people accessing my sites won’t get a dns error?

(Also, when you’re back there is nothing to do)?

An amazingly fast 26 minutes later, I had a response:

Great question! We will have a holding page up (hosted outside of Squarespace) that will provide messaging about the downtime. Any customers trying to access sites during that time will see that message. Once we are able to bring the system back up, there will be nothing required of you in order for your website to come back online. We expect sites to be available for another 45 minutes at least and please keep an eye for updates on Twitter (@Squarespace, @Squarespacehelp) as we will be providing updates as regularly as possible from there. 

Hope this helps!

Shaun H

Of course, being rather chatty, I responded with:

I will expect something very creative….

Like an overhead view of Sandy going down a toilet bowl. (Trying to bring a smile to your face during this tough time.)

Think the Squarespace version of the Twitter flying whale.

And again, had a quick response

Hey Gary,

That’s a great and hilarious suggestion, thank you :) We will definitely keep you updated on our Twitter accounts and Blog page (for as long as we can):

https://twitter.com/squarespace

and

http://blog.squarespace.com/

Hope this helps.

Paulina V.

What then followed was something I find speaks to the spirit of a team focused on service. 

They carried fuel to the generator on the roof.  17 floors.  All by hand.  Squarespace, their co-lo provider PEER1, another company Fog Creek (an online project management firm for collaborative software development) and some hired contractors carried the fuel to the 17th floor where it could be pumped up to the generator on the roof (18th floor).

Ok, let’s do some math.  According to PEER1’s Meredith Eaton, a company spokeswoman, the generator’s consumption rate was about 40 gallons/hour.  That’s eight 5 gallon pails an hour.  At 7.15 pounds/gallon diesel, that’s 286 pounds an hour up 17 floors.  And they did this for a couple days…so at 48 hours this is 13,728 pounds of fuel, or nearly 7 US tons of fuel.

The following pictures are used with permission of Squarespace:

 Thirsty generator, on the roof above 17 floorsBasement level, where the fuel is supposed to be stored

 

Diesel fuel on street waiting for a lift

Part of the bucket bridgade

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Now, many would argue Squarespace would be better off with a second data center somewhere with automated failover.  That would carry an increased cost, something this author wouldn’t be willing to pay for.  Disaster recovery desires must be analyzed in light of the costs.  It’s almost laughable to consider a Recovery Time Objective or Recovery Point Objective for these blogs.  If they are down for days, frankly it wouldn’t matter.  These blogs are not time sensitive, with the closest financial impact being on the Mark Fidrych Foundation with donation ability (I encourage you to use!)

So due to the heroic efforts of the unsung IT Operations and associated people, PEER1 stayed up, and you are able to enjoy reading this post.

My hat goes off to these people who persevered, with determination and grit, to keep the site going.  In a word, amazing.  I find IT organizations do this often to keep the ship afloat, often without complaint.

Will we continue hosting on Squarespace?  You betcha.

What stories do you have of heroic IT efforts?

One midnight shot of a total bucket brigade

Monday
Oct222012

Disaster Recovery vs. SHTF planning

Maybe it is the economy, or maybe the election.  Increasingly I am seeing people talking about “prepping” for when we descend into anarchy.

From SHTFPLAN - 8 Reasons Why The Great Depression Is The Best Case Scenario

The number of websites for this is impressive…here is a sample:

 There are also people analyzing where to go and publishing books on it.

Far be it for me to criticize these people.  Heck, they might be right.  That said, if you really believe the end is coming, you’d buy a piece of land and go “off the grid.”

It’s hard to fathom what off the grid would be like.  While it may be fun to flirt with the idea of living simply, the truth is the vast majority of us would struggle to do so.

Where is this rambling post going?

It’s about what we can do to make sure we are not thrown back into the dark ages. 

So many companies still believe Disaster Recovery for Systems is optional.   That’s just irresponsible.

DR plans need to be put together, and tested, so outages are minimized.  Plans need to meeting the needs of the business from a Recovery Point Objective (RPO) and a Recovery Time Objective (RTO).  This all starts with a Business Impact Analysis….understanding the impacts of outages.

Depending upon your business, the recovery plan can be very simple or very elaborate.  As a consultant, I live on my laptop and so have a backup (old) laptop.  My documents are stored in the cloud and are backed up.  Simple stuff.

If you run a trading business, your needs are for real time replication and activation.  More elaborate and certainly costly…and “cheap” compared to being out of the market during a market swing.

Do I have some MREs (Meals Ready to Eat) at home?  Yes, enough for a couple weeks as a simple storm can take me personally off the grid.  I’d be fine for a couple weeks.  So for me DR preparedness is about common sense, spending wisely, and knowing you can recover.

If the end of the world does come, let me apologize now for asking you to prepare for an event.  You would have been better off digging a hole for your bunker.

What are your thoughts?

Monday
Mar262012

When to Update Production and DR

Some companies run a “production only” environment.  Think a restaurant, where they buy packaged software and can use paper as a backup system.  The chances are excellent they buy packaged software, and are looking to the software provider to have proper systems management.

Other companies can take environments to another extreme…having multiple development, certification, integration, performance, staging production and disaster recovery environments, steadfastly promoting code through each environment.  Many organizations take “release management,” to levels comparable to large software firms.

Of course, the age old debate of few formal releases vs quick regular releases is always in play.

This post tackles a simple question.  Assuming you have both a production and a disaster recovery environment, what is the upgrade order?

 

Obviously the circumstances in your environment will drive what you do.  One might argue in a truly active:active environment, there is no such concept as disaster recovery.  For purposes of this post, production and DR are separate, failover is possible, and failover is tested regularly.

Some people argue the natural upgrade approach is to upgrade disaster recovery first, then complete an upgrade process by touching production.  In this approach, every stage of the promotion process is views as tests preserving production for the “final” change.

We posit Disaster Recovery should be upgraded after ensuring Production is stable.

It’s not that we don’t think Production is important.  To the contrary, we revere the Production environment.

By upgrading DR after Production, you assure the business can failover if the upgrade proves untenable in production.  There is always a known working copy available.

IT professionals following this approach have to determine when Production is stable.  Is it an hour?  A day?  A cycle?

We suggest it is after a day’s stable processing.  A day is arbitrary; as a practical matter once changes are in place there is a point of no return where any fixes will be made in the new environment and not after failing back.

IT professionals must remember a promotion cycle is not complete until DR is upgraded.  When organizations neglect to upgrade disaster recovery, they lose their failover ability.

Is this a once size fits all recommendation?  No, you need to look at your environment and the changes underway.  Database Schema changes, and core functionality changes may preclude a phased approach.  Since we fundamentally don’t subscribe to big bang, we suggest always trying to maintain a failover ability.

How does your organization deal with upgrades/migrations to minimize risk?

Monday
Jan092012

Every Problem has a Solution

It’s Sunday as I write this.  Years of early morning alarms had me wide awake at 5AM.  The house was still quiet, and it seemed natural to catch up on some TV.

Years ago, my daughter and I would watch Grey’s Anatomy together.  She’s since moved off the show, and I still enjoy.  My DVR had the most recent episode…what better thing to catch at 5AM?

Grey’s Anatomy, courtesy ABC.GO.COM

So, yes world, I watch Grey’s.  And I found myself in full tears during this episode (if you watch online, I was in tears at the 28:50 mark, as an eighteen year old makes life and death decisions for her father.)

In IT, we generally don’t have life and death situations.  In healthcare IT, at an extreme our systems may impact healthcare…and yet as IT staff we are not faced with it.

Certainly many IT types have uttered the words, “I’m going to get killed if…. <fill in the blank>”

  • This system doesn’t come back up
  • This project is late
  •  This runs over budget

The truth is, worst case is someone losing their job…and that only happens in extreme cases.

Recently I was having a conversation with the VP of Operations for a major hosting company and commenting on how calm he (always) is.  He laughed…and shared there are nights he doesn’t get sleep.  And that in IT, every problem has a solution.

At first I wanted to challenge the point, and as I thought about it more…..agreed with him.  By the time something is through the development cycle and into production, the “unsolvable problems” have been addressed.

Sure, hardware may break and can be repaired, software may fail and need to be patched…but IT always rises to the challenge.

Management might suggest issues sometimes take too long to address…and that’s where management’s commitment to real, tested, and usable disaster recovery (DR) often comes into play.  With “real” DR, business impacts are often minimized.

Has your business ever died due to IT?