It’s typical for hyper-scale data center operators like Amazon to build their own infrastructure technology when it isn’t available on the market or when they feel they can make it cheaper on their own.
One piece of technology Amazon built in-house is meant to circumvent what one of the company’s top infrastructure engineers described as misplaced priorities in the way electrical switchgear vendors design their products.
It is this problem that likely caused last summer’s Delta data center outage that ultimately cost the airline $150 million, as well as the infamous 2013 power outage during Super Bowl. And John Hamilton, VP and distinguished engineer at Amazon Web Services, has seen this type of failure in data centers he has overseen during his career.
“Operating at much higher scale, I’ve personally encountered it twice in my working life,” he wrote in a post to his personal blog. It’s unclear where he was working when those failures happened, but the engineer spent about a decade at Microsoft before joining Amazon.
Hamilton did not reference Delta specifically in his blog post, but there was only one major airline data center outage last summer from which the airline later disclosed nine-figure fallout.
See also: How to Survive a Cloud Meltdown
The piece of technology Amazon designed to avoid this type of outage is the firmware that decides what electrical switchgear should do when a data center loses utility power. Typical vendor firmware prioritizes preventing damage to expensive backup generators over preventing a full data center outage, according to Hamilton. Amazon (and probably most other large-scale data center operators) prefers risking the loss of a sub-$1 million piece of equipment rather than risking widespread application downtime.
When everything happens as expected during a utility outage (which is the case most of the time), the switchgear waits a few seconds in case utility power comes back (also the most common scenario) and if it doesn’t, the switchgear fires up generators, while the data center runs on energy stored by UPS systems. Once the generators are stabilized, the switchgear makes them the primary source of power to the IT systems.
Last year’s Delta data center outage was attributed to switchgear “locking out” the generators at the airline’s facility in Atlanta. That’s what most switchgear is designed to do when it senses a major voltage anomaly either in the data center or on the incoming utility feed. Plugging a live generator into a shorted circuit will usually fry the generator, and switchgear locks generators out to avoid that.
In most cases, the fault is outside of the building, so this scheme achieves nothing other than causing a data center outage, Hamilton wrote. (The two events he’s witnessed were caused by cars knocking over aluminum polls, which fell across electrical transmission cables.) In the rare event when there’s a short inside the data center, either a branch circuit breaker opens and the servers it feeds switch to a secondary source of power, or (if the fault is higher in the power distribution system or if a breaker fails to open) a generator may get damaged if it’s not locked out.
“I would rather put just under $1 million at risk than be guaranteed that the load will be dropped. If just one customer could lose $100 million, saving the generator just doesn’t feel like the right priority,” he wrote.
When Amazon engineers asked their switchgear manufacturer to eliminate the lockout condition from their firmware — with the understanding that they were willing to accept the potential equipment failure — the vendor declined, forcing the decision by Amazon to make its own firmware in-house.
“I’m lucky enough to work at a high-scale operator where custom engineering to avoid even a rare fault still makes excellent economic sense, so we solved this particular fault mode some years back,” Hamilton wrote.