Steve Francis is Founder and Chief Product Officer of LogicMonitor.
In a recent issue of The New Yorker, an interesting article by Malcolm Gladwell appeared called "The Engineer’s Lament". Gladwell revisits the 1970's Ford incident, where the top- selling car, the Pinto, exploded, culminating in the indictment of the Ford Motor Company for reckless homicide. The author discusses the variety of perceptions that arose after the accident and how the viewpoint of the public varied drastically from Ford engineers and the National Highway Traffic Safety Administration (NHTSA).
The public saw an isolated “worst-case outcome” that resulted in catastrophic deaths, believing that engineering adaptations could have reduced the risk of these events.
The engineers saw cars that were performing within specifications. The Pinto had exactly the same rate of fatal fires (1.9 percent) as the percentage of cars on the road that were not Pintos (1.9 percent). Its construction, design and incident rates were all very similar to its competitors: marginally better in some areas, marginally worse in others.
So, why did the Pinto become the poster car for flawed design? I suggest it was because people were looking to make a name for themselves – to become heroes. Journalists, attorneys and politicians heard about the edge cases (such as a Pinto exploding into flames when hit by a van at 50 miles per hour), then investigated, found more edge cases, and decided that something must be done. The result, though well meaning, actually diverted resources from an approach that could have had a far greater impact on automobile safety, and would probably have had zero impact on the cases that triggered the publicity.
So what does this have to do with your IT infrastructure?
This scenario illustrates a trap that even the best executives can fall into: focusing on heroic actions instead of best engineering practices. We all want to reward and recognize team members that go above and beyond expectations, but it’s sometimes hard to differentiate between the good and not-so-good: those activities that are heroic because of an issue that could not have been foreseen, and those that are heroic-seeming reactions that actually divert resources from more meaningful work.
Here's an example of the latter. After I left a company to move elsewhere, I heard about a staff member that used to report to me being commended at an all-company meeting. This person drove 90 minutes in the middle of the night to a reboot a misbehaving server in the data center, and recovered from the resulting outage. While it’s great that he was willing to do that, it is not at all heroic. First, the data center had remote hands capability (staff that can be used for exactly these kinds of tasks). Why not call them and have the server rebooted in five minutes (thus reducing the time of the outage by 95 percent)? Why wasn’t every server reachable out-of-band, by ILOM (Integrated Light Out Management cards) or console, or able to be hard-cycled by managed power strips? Under those circumstances, the staff member could have rebooted the server from the comfort of his home bedroom. Lastly, why did the failure of a single server cause an outage in the first place? That's another issue that should be investigated.
Another example is when IT administrators perform a lot of work to achieve a big result that has no impact on the business. For example, tuning kernel IO schedulers and reducing logged messages may improve CPU efficiency by a few percentage points. However, if all the systems use less than 50 percent of CPU, but storage latency is high and no one has tuned mount options, then the heroic work to tune CPU was a misplaced effort, regardless of how much it improved things.
It’s far more valuable for a company not to have situations that require heroics by making sure IT infrastructure works. Since not all projects are worth unlimited budgets, it’s not realistic to have 100 percent uptime planned for everything. Instead, focus on the most common causes of service disruption, and devise reasonable plans for dealing with them.
If the budget and other constraints given to an IT department are limited, do not expect every application design to tolerate rare but catastrophic events, such as someone hitting the Emergency Power Off button in the data center. Just as importantly, before lauding someone as a hero, understand whether the problem addressed was something that basic engineering would have already solved, or that a rational assessment of priorities would have argued not be done at all.
IT operations is a team sport, and the best teams will be adding value regularly – not providing opportunities for heroism.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.