Want email updates from me?
Want more unvarnished truth?
Looking for something? Look here!
What I'm saying now
What you're saying...
I think tag clouds are pretty, and not to be taken overly seriously
111 Chop House 75 on Liberty Wharf 9/11 A Broth of a Boy ABCs Abiouness accountability activities alcohol Allora Ristorante Analysis Angry Hams ANSI/TIA 942 Anthony's Pier 4 Apple Application Armsby Abbey Arsenal Arturo's Ristorante Ashland AT&T Audio Automation baby Baby Monitor babysitting Back To School Bad News Bangkok Thai Bar Bay State Common baystateparent BBQ BCP BeeZers Before I die I want to... behavior Big Bang Bike Bill of Rights Bistro Black Box BlackBerry Boston Boston Marathon boundaries Boyston BPO brand Breakfast Bridge Bring Your Own Technology Budget Burlington Burn Burrito buyer BYOD Cabling Cambridge Camp Campaign career Casey's Diner Castle casual cCabling Cell Phone Central Square Change Management Cheers Chef Sun ChengDu Chet's Diner Children Chinese Christmas Christmas Families Holiday CIO Cloud coddle collage College College Acceptance co-lo Co-Location Co-Location Tier Power Cooling Comfort Food Control Country Country Kettle Crisis customer dad Dad Phrases damage daredevil Data Center Data Center Design Davios Day Care Dead Death declaration Del Frisco's Design Desktop Video dinner Disaster Recovery Divorce Do Epic Shit dodgeball Downtown Crossing DR driving Droid Easter Economic Kids Edaville Education Elbow Night Elevator Employee Engagement Erin Etiquette Evaluation events Exchange Expiration Dates Facebook Failing family Fatherhood Favorite things Flash Flemings Fogo de Chão Food Hits and Misses Format Foundry on Elm Foxborough Frameworks fraternity Fraud French Fried Clams friends fun Fusion Generations germs Girl Scouts girls Global Go/No Go GPS Grafton Grandchild Grandpa Harry's hazing Healthcare Healthy Choices while Dining Out Help Desk Hisa Japanese Cuisine Historic holiday Home hope Horizons hose Hot Dog Hurricane IIT Assessment incident Indecision Indian Infrastructure Inn Innovation Internet Inventory Management iPhone IT IT Assessment IT Satisfaction Italian Jack Daniels Jakes Restaurant Janet Japanese Jazz Joey's Bar and Grill JP's Khatta Mitha kickball kids Laid off Lakes Region Lala Java Leadership Learning legacy Legal Harborside Les Zygomates L'Espalier Liberty Wharf lights out Linguine's loss Love Lucky's Cafe M&M Macys Thanksgiving Day Parade mai tai Managed Application Services Managed Services managers Mandarin Manners Mark Fidrych marriage Mary Chung mass save Maxwell-Silverman Mediterranean meetings Memorial Day memory Mendon Mergers Mexican MiFi Migration Ming III miss MIT MIT CIO Symposium Mobility Moes Hot Dog Truck MOM money Mother Moving on Name neanderthal neighborhood Network New York Marathon newborn Northborough Not Your Average Joe's Nuovo Nursing On-Call Operations Operators Oregon Club Organization Pancakes Pandemic Parental Control Parenting Patch Peeves People Perserverance UMASS growth Play Plug and Run Predictable Pride Problem Process Production program Project Management propane PTA. PTO PUE QR Quick Response Rant Real Estate Realtor Recognition Red Rock Resiliency Respect restaurant Restaurant Guy RFP ribs Ritual Root Cause Analysis Sam Adams Sandy Sapporo savings School Sea Dog Brewing Company Sea Dog Steak and Ale Seafood Seaport Security Sel de la Terra Service Service Desk Service Indicator Light sharing SHIRO Shit Pump Shriners SHTF Simplification Skunk Works Skype Sleep sleepovers Sloan Smith & Wollensky soccer Son SOP sorority spanking Squarespace staffing Starbucks Status Reporting Steak Steve Jobs Storage Strategy stress Summer Sushi swimming Tacos Acalpulco teacher Technology Teen Telephony Temperature Strip Tenka terrorist Testing Texas BBQ Company Text Thai Thanksgiving in IT The Mooring Thomas Thought Leader Three Gorges III TIA 942 Timesheets Toby Keith Toddlers traditions Transition treehouse turnover TV Twitter unspoken moments Valentine's Day Value Vendor Venezuelan Verizon Vermont Video Vietnamese voice VoIP Watertown Wedding Westborough Korean Restaurant Westborough MA. StormCam WiFI Wi-Fi Wilbraham Wine Worcester work work life balance working Yama Zakura Zem Han Zitis

Entries in Operations (5)

Saturday
Nov032012

IT Operations – The Unsung Heroes

This is a story of how one company and its operations staff kept the lights on in the face of post-Hurricane in New York City.

Three blogging websites are under my general control:

  • Garylkelley.com – a site about technology, fatherhood and restaurant reviews
  • Curriculotta.com – an “alter ego” site for the properly starched shirts over at Harvard Partners
  • Markfidrychfoundation.org – a site for furthering the community work of the late ball player Mark Fidrych.

Squarespace is used to “host” these sites.  WordPress was our original choice, and served us for a while.  I was just never a fan for how WordPress “thinks.”  Personally, I prefer Squarespace.

Squarespace uses PEER1 as their co-location provider, located in New York at 75 Broad Street.

You can imagine my personal dismay when Tuesday, October 30 at 11:34AM I got the following message from Squarespace:

I have some unfortunate news to share. Our primary data center, Peer1, in Lower Manhattan lost power yesterday at about 4:30PM local time. At that time, we smoothly made the transition to generator power and took comfort over the fact that we had enough fuel to last three to four days. (Peer1 stayed online during the last 3 major natural disasters in the area, including a blackout that lasted for days.)

At 8:30PM yesterday, we received reports that the lobby in the data center’s building was beginning to take on water. By 10:30PM, as is sadly the case in most of Lower Manhattan, Peer1’s basement had experienced serious flooding. At 5AM, we learned our data center’s fuel pumps and fuel tanks were completely flooded and unable to deliver any more fuel. At 8AM, they reported that the generators would be able to run for a maximum of four more hours.

Unfortunately, this means that Squarespace will be offline soon (our estimate being at 10:45 AM today).

I then did what any IT ops person would do…and notified my users of this outage:


Of course, I then did what any user would do, and emailed Squarespace support (like they had time for me.)

Can you guys toss up a graphic of some kind so people accessing my sites won’t get a dns error?

(Also, when you’re back there is nothing to do)?

An amazingly fast 26 minutes later, I had a response:

Great question! We will have a holding page up (hosted outside of Squarespace) that will provide messaging about the downtime. Any customers trying to access sites during that time will see that message. Once we are able to bring the system back up, there will be nothing required of you in order for your website to come back online. We expect sites to be available for another 45 minutes at least and please keep an eye for updates on Twitter (@Squarespace, @Squarespacehelp) as we will be providing updates as regularly as possible from there. 

Hope this helps!

Shaun H

Of course, being rather chatty, I responded with:

I will expect something very creative….

Like an overhead view of Sandy going down a toilet bowl. (Trying to bring a smile to your face during this tough time.)

Think the Squarespace version of the Twitter flying whale.

And again, had a quick response

Hey Gary,

That’s a great and hilarious suggestion, thank you :) We will definitely keep you updated on our Twitter accounts and Blog page (for as long as we can):

https://twitter.com/squarespace

and

http://blog.squarespace.com/

Hope this helps.

Paulina V.

What then followed was something I find speaks to the spirit of a team focused on service. 

They carried fuel to the generator on the roof.  17 floors.  All by hand.  Squarespace, their co-lo provider PEER1, another company Fog Creek (an online project management firm for collaborative software development) and some hired contractors carried the fuel to the 17th floor where it could be pumped up to the generator on the roof (18th floor).

Ok, let’s do some math.  According to PEER1’s Meredith Eaton, a company spokeswoman, the generator’s consumption rate was about 40 gallons/hour.  That’s eight 5 gallon pails an hour.  At 7.15 pounds/gallon diesel, that’s 286 pounds an hour up 17 floors.  And they did this for a couple days…so at 48 hours this is 13,728 pounds of fuel, or nearly 7 US tons of fuel.

The following pictures are used with permission of Squarespace:

 Thirsty generator, on the roof above 17 floorsBasement level, where the fuel is supposed to be stored

 

Diesel fuel on street waiting for a lift

Part of the bucket bridgade

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Now, many would argue Squarespace would be better off with a second data center somewhere with automated failover.  That would carry an increased cost, something this author wouldn’t be willing to pay for.  Disaster recovery desires must be analyzed in light of the costs.  It’s almost laughable to consider a Recovery Time Objective or Recovery Point Objective for these blogs.  If they are down for days, frankly it wouldn’t matter.  These blogs are not time sensitive, with the closest financial impact being on the Mark Fidrych Foundation with donation ability (I encourage you to use!)

So due to the heroic efforts of the unsung IT Operations and associated people, PEER1 stayed up, and you are able to enjoy reading this post.

My hat goes off to these people who persevered, with determination and grit, to keep the site going.  In a word, amazing.  I find IT organizations do this often to keep the ship afloat, often without complaint.

Will we continue hosting on Squarespace?  You betcha.

What stories do you have of heroic IT efforts?

One midnight shot of a total bucket brigade

Monday
Nov022009

Morning Operations Meeting

“Nothing productive ever happened in a meeting,” a friend once stated. He is a thoughtful guy, and his comment was not one to be idly dismissed. As you ponder this during the next meeting you attend, consider the value of a daily touch base on operational issues.


DAILY? Surely you jest.

Whether in crisis or not, a daily session is imperative in any well run operations area. And believe it or not, the meeting can be accomplished in under 10 minutes! It’s all about predictability and preparation.


Predictability

When running meetings like this, use a conference bridge with the same ID each day. Attendees shouldn’t have to search around for the contact information. Use an acronym if you can (the Morning Operations Meeting can be referenced as MOM. A conference bridge of CALLMOM (2255666) is easy to remember.

If there’s a critical mass of people at one location, try to use a conference room at that location to run the meeting. Far flung attendees participating by conference bridge is one thing, “locals” can come attend the meeting (rather than sitting at their desks reading emails!)

Pick a time when everyone can attend, based on your business day. Financial services companies will want to have the meeting well before the US stock market opens at 9:30AM (8:00 AM is a good time). If you are a retailer with stores opening at 8:00AM, an earlier time may be more appropriate.

Start the meeting on time each day. Nothing ruins the attendance and contributes to time creep than a meeting where the start time waffles. To do this, a backup chairperson should be in place to start the meeting if the chair is delayed.

The meeting should have the same agenda each day:


  • Roll call

  • Area by area review of any major (customer impacting) issues over the past 24 hours, with an emphasis on any active issues

  • Follow up on prior action items

Minutes should be captured, and emailed to each of the areas.

Preparation

Preparation is another key to this meeting. Since the agenda is the same each day, the “areas” for review can be pre-populated on draft email. Over the 24 hours from the last meeting, Operations and the Help Desk should “contribute” major customer impacting issues to the draft. So when the meeting actually happens, the Chair is following a script of the meeting (literally reviewing a draft of the “minutes”.)

As the meeting is held, the chair can “prompt” speakers if certain issues are glossed over or missed. In this manner, major issues are not missed.

Details are not covered in this status meeting. If the issue is still active, it is placed on “follow up,” and brought back to the meeting. The chair has discretion for cutting off a discussion.

Once the meeting is completed, a brief Summary should be added to the email (suitable for reading on a BlackBerry) and the send key pressed. A wiki can also be used for this.


With predictability and preparation, the meeting will flow smoothly. Plan the meeting will run long the first week or so as people adapt to the meeting style.

Once the minutes start being read, it’s common for people to start wanting the “edit” the minutes after the fact. Some will want immediate retractions issued. My recommendation is to offer to add a “correction” section at the bottom of the minutes and issue as a part of the daily cycle. Do not get into multiple MOM minutes.

Savvy areas will want to review the “script” in advance. Why not? It allows the overall product to be stronger provided the information is factual.

And one last fun suggestion. Play into the MOM (as Mother) theme. “It’s OK to tell MOM anything. MOM is here to help.” It allows a subtle mindset shift.

And remember, you can fool some of the people all of the time, all of the people some of the time, but you can’t fool MOM.

Saturday
Oct242009

Data Center Disciplines

I have been teased my entire career about my nearly obsessive behavior around keeping data center rooms neat and tidy. While I’d love to blame my Mother for my neatness, the truth is keeping a data center clean is about one word: discipline.


Having a neat and tidy data center environment sends a reinforcing message to everyone entering about the gravity of the work performed by the systems in the area. This is important for staff, vendors, and clients.

The observational characteristics I look at when I walk in a data center are:


  • Life Safety – are the aisles generally clear? Are there Emergency Power Off switches and fire extinguishers by the main doors, is there a fire suppression system in place, is the lighting all working….

  • Cleanliness – Forty years ago data centers were kept spotless to prevent disk failures. A speck of dust might make a hard drive disk head fail. These days, disks are generally sealed, and can operate in pretty rough environments (consider the abuse of a laptop disk drive.)
    While disk drives are generally sealed, why should data centers be dirty? Look for dust, dirty floors, and filthy areas under raised flooring. One data center I went in had pallets of equipment stored in the space…was the data center for computing or warehousing?

  • Underfloor areas – are the underfloor areas, assuming use as an HVAC plenum, generally unobstructed? More than one data center I’ve been in had so much cable (much abandoned in place) under the floor the floor tiles wouldn’t lay flat. This impacts airflow, and makes maintenance a challenge.
    I also like to see if the floor tiles are all in place, and if some mechanism is used to prevent cold air escaping through any penetrations. 30% of the cost of running a data center is in the cooling, and making sure the cooling is getting where it needs to be is key. (While at the opposite end of the space, I like to see all ceiling tiles in place. Why cool the area above the ceiling?)

  • HVAC – are the HVAC units working properly? Go in enough data centers, and you’ll learn how to hear if a bearing is failing, or observe if the HVAC filters are not in place. As you walk the room, you can simply feel whether there are hot spots or cold spots. Many units have on board temperature and humidity gauges – are the units running in an acceptable range?

  • Power Distribution Units – are the PDUs filled to the brim, or is available space available? Are blanks inserted into removed breaker positions, or are their “open holes” to the power. When on-board metering is available, are the different phases running within a small tolerance of each other? If not, outages can occur when hot legs trip.

  • Hot Aisle/Cold Aisle – Years ago all equipment in data centers was lined up like soldiers. This led to all equipment in the front of the room being cool, and all the heat cascading to the rear of the room. Most servers today will operate as high as 90 degrees before they shut themselves down or fry. By having a hot aisle/cold aisle orientation, including blanks in empty shelves on servers, cooling is most effectively in place. Some organizations have moved to cooling being in the racks as a designed alternative.

  • Cable plant – the power and communications cable plants are always an interesting tell tale sign of data center disciplines. Cables should always be run with 90 degree turns (no transcontinental cable runs, no need for “cable stretching”). Different layers of cables under a raised floor are common (power near the floor, followed by copper communications then fiber). (A pet peeve of mine in looking at the cable plant is how much of the data center space is occupied with cables. Cables need to get to the equipment, but the cable plant can be outside the cooled footprint of the data centers. Taking up valuable data center space for patch panels seems wasteful. One data center devoted 25% of the raised floor space for cable patch panels. All this could have been in not conditioned space.)

  • Error lights – As you walk around the data center, look to see what error lights are illuminated. Servers are often monitored electronically, and error lights utility is lessened is a argument. That said, error lights on servers, disk units, communications units, HVAC, Power Distribution units and the like are just that: errors. The root cause of the error should be eliminated.

  • Leave Behinds – what’s left in the data center is often an interesting archeological study. While most documentation is available on line, manuals from systems long since retired are often found in the high priced air and humidity controlled data center environment. Tools from completed projects laying around are a sign thoughtfulness isn’t in place for technicians (I’ll bet their own tools are where they belong).

  • Security – data centers should be locked, and the doors should be kept closed. Access should be severely limited to individuals with Change or Incident tickets. This helps eliminate the honest mistakes.

While far from an inclusive list, this article is to help silence my lifelong critics about my data center obsessions. These are simple things anyone can do to form a point of view on data center disciplines. Obviously follow ons with reporting, staff discussions, etc. is appropriate.

Tuesday
Oct202009

Wanted: Technology to Drive Process

“Technology driving process; It’s not supposed to work that way.”

Anyone with formal training in process engineering knows you start with defining and optimizing your processes, and then use technology to streamline those processes. In an ideal world, this works perfectly. Layer organizational structures, system ownership and governance models, and personalities, and people naturally gravitate towards what they know best; technology.

A number of years ago I began moving my Infrastructure & Operations division to more of a process-based organization. I, and my management team, attended a series of seminars given by the late Dr. Michael Hammer. We received our certificates in “Process Mastery” and, with our newfound evangelical powers, were ready to transform the organization.

Barbara, having a reputation as an overachiever, volunteered her group, the Desktop Support, Engineering, and Help Desk department, as the first to make the move into the world of process. Over the period of a year, Barbara documented existing processes, designed new process where others were missing, created a process map, developed Service Level Agreements with users and other IT groups, and conducted training sessions with her team. The results were better delegation of tasks to the right individuals; managing to metrics, happier employees, and, most importantly, improved service levels.

As Barbara’s manager, I pushed for more improvement. Barbara responded by identifying Help Desk requests that could not be resolved on the first call and required assistance from others in the IT organization. In reviewing the list, Barbara realized 30% of the tasks could be shifted to the Help Desk and drive down costs while dramatically improving resolution time for the user. Requests such as resetting passwords, granting access to network file shares, provisioning user logins, email accounts, and printers, distribution of remote access security tokens and instructions, and creation of new user profiles were all currently being performed by senior systems administrators. The management team thought Barbara’s idea was great, and she was empowered to make it happen.

Barbara thought this would be simple. She had achieved what she thought was buy-in from the entire infrastructure and operations organization. What she didn’t expect was resistance around what people believed gave them power. Employees in other groups were fine giving over these mundane tasks so long as they still had control over approving each transaction. They felt this authority (and the trust that went along with it) is what made them special to me and others in the IT organization. What they failed to see was the erosion in the level of respect they received when they complained about not having enough resources, but failed to seize this opportunity.

The solution came through implementing technologies enabling Help Desk personnel to grant user privileges without being server administrators. Once the systems administrators saw they had not lost any control, were able to delegate tasks to the Help Desk, and were able to focus on more high value projects, they began to think about methods to offload other processes to the Help Desk. Success was achieved.

The lesson from this story is the need to discover what drives people before reengineering their processes. In this situation, the sense of control the ability to manage the technology was the drivers for the Systems Administrators. Empowering the Systems Administrators to use technology and enable the transferring of some of their processes made them supporters and eventually advocates.

In the case of the Systems Administrators, they defined their processes by technology. Therefore, technology became their driver for re-engineering their processes.

Some of the “squishier” skills, such as process, project management, client management, and budget management can be difficult for administrators and operators to understand or appreciate, and can be in conflict with their priorities. In a production IT shop, keeping systems up and running and eliminating any user downtime is the top priority. Asking people to take time away from their priorities, particularly in these challenging times, may be counter-productive and may produce defensive behavior. Take care in understanding your audience .

Saturday
Oct172009

The Implications of Being “On-Call”

Dear family and friends,

I have been paged by the Operators at work because there’s an issue requiring attention and I’m on-call.

Being on call is important. Something is broken and needs to be fixed. Since I work in a larger team, I have to cover every once in a while, although it seems my “on-call” periods fall when important things happen. When you ask me, “Isn’t there someone else they can call?” it’s simply that it is my turn.

You need to know I see the disappointment on your face when you hear the paging tone on my smart phone.

The truth is I feel the same way; the page is an intrusion into our lives and often it comes at inopportune times.

You see, part of my job is fixing issues, and the other part is making sure we don’t have issues in the first place. That said, things happen.

Yes, I remember being on a conference call Christmas Eve. I haven’t forgotten leaving the concert so I could get to the closest PC (I guess the wireless card improves that!) Looking at your eighth grade “graduation” pictures, snapped while I was in the hallway talking someone through an issue, makes me sad. That special weekend in Nantucket was ruined with me on the phone Saturday night.

Carrying a laptop around every few weeks isn’t my idea of a good time either. It’s heavy, and I can’t have the freedom to ride the rides, go down the water slide, or just be playful with you.

Some people say, “it’s my job, it pays the bills, get used to it.” While true at some level, the times I’ve missed pale by comparison to a “job.” Systems people get paged, and have to fix things.

Other professions use on call rotations, too. When you are ill, and want to talk to your Doctor, they get a call. Stock traders watch the markets around the world, some even changing their sleep pattern to be “up” for other markets. The plumber was with his family too Thanksgiving when the drain backed up.

When I get paged, there’s often emptiness in my heart. If fixing the problem takes a long time, I really do miss you and often hear you continuing the fun on the other side of the door. And while I’m happy to solve an issue, I also feel really badly when we can’t just pick up where we left off. You see, to me our time together “freezes” when I go into problem solving mode, while you move on to the next thing.

While I’m away, take extra pictures and save me dessert. I’m not being rude, in fact to the contrary I am very torn.

As soon as I get back, let’s try picking up where we left off.