A Storm of Servers: How the Leap Second Led Facebook to Build DCIM Tools
August 6th, 2013 By: Rich Miller
ASHBURN, Va. - For data centers filled with thousands of servers, it’s a nightmare scenario: a huge, sudden power spike as CPU usage soars on every server.
Last July 1, that scenario became real as the “Leap Second” bug caused many Linux servers to get stuck in a loop, endlessly checking the date and time. At the Internet’s busiest data centers, power usage almost instantly spiked by megawatts, stress-testing the facility’s power load and the user’s capacity planning.
The experience is yielding insights into the operations of Facebook’s data centers, and may result in new tools to help hyper-scale companies manage workloads. The Leap Second “server storm” has prompted the company to develop new software for data center infrastructure management (DCIM) to provide a complete view of its infrastructure, spanning everything from the servers to the generators.
For Facebook, the incident also offered insights into the value of flexible power design in its data centers, which kept the status updates flowing as the company nearly maxed out its power capacity.
The Leap Second: What Happened
The leap second bug is a time-handling problem that is a distant relative of the Y2K date issue. A leap second is a one second adjustment that is occasionally applied to Universal Time (UTC) to account for variations in the speed of earth’s rotation. The 2012 Leap Second was observed at midnight on July 1.
A number of web sites immediately experienced problems, including Reddit, Gawker, Stumbleupon and LinkedIn. More significantly, air travel in Australia was disrupted as the IT systems for Qantas and Virgin Australia experienced difficulty handling the time adjustment.
What was happening? The additional second caused particular problems for Linux systems that use the Network Time Protocol (NTP) to synchronize their systems with atomic clocks. The leap second caused these systems to believe that time had “expired,” triggering a loop condition in which the system endlessly sought to check the date, spiking CPU usage and power draw.
As midnight arrived in the Eastern time zone, power usage spiked dramatically in Facebook’s data centers in Virginia, as tens of thousands of servers spun up, frantically trying to sort out the time and date. The Facebook web site stayed online, but the bug created some challenges for the data center team.
“We did lose some cabinets when row level breakers tripped due to high load,” said Tom Furlong, the VP of Site Operations for Facebook. “The number of cabinets brought down was not significant enough for it to impact our users.”
Facebook wasn’t the only one seeing a huge power surge. German web host Hetzner AG said its power usage spiked by 1 megawatt – the equivalent of the power usage of about 1,000 households. The huge French web host OVH, which was running more than 140,000 servers at the time, also reported a huge power spike.
Electric power is the most precious commodity in a server farm. The capacity and cost of power are the guiding decisions for most data center customers. The Leap Second bug raised a key question: what happens if a huge company unexpectedly maxes out its available power?
The Power Perspective
It’s the type of question that brings wonky debates about power system design into sharp relief. Fortunately, it wasn’t an academic question for the team at DuPont Fabros Technology (DFT).
DuPont Fabros builds and manages data centers that house many of the Internet’s premier brands, including Apple, Microsoft, Yahoo and Facebook. As the leap second bug triggered, the power usage surged within one of DFT’s huge data centers in Ashburn, Virginia.
Hossein Fateh, the President and CEO of DuPont Fabros, said one of the tenants in the building suddenly saw its power load surge from 10 megawatts to 13 megawatts, and stay there for about 5 hours.
“The building can take it, and that tenant knew it,” said Fateh. “We encourage tenants to go to 99 and 100 percent (of their power capacity). Some have gone over.”
Fateh’s confidence is rooted in DFT’s approach to power infrastructure. It builds its data centers using what’s known as an ISO-parallel design, a flexible approach that offers both the redundancy of parallel designs and the ability to isolate problems.
Multi-tenant data centers like DFT’s contain multiple data halls, which provide customers with dedicated space for their IT gear. The ISO-parallel design employs a common bus to conduct electricity for the building, but also has a choke that can isolate a data hall experiencing electrical faults, protecting other users from power problems. But the system can also “borrow” spare capacity from other data halls.
“The power infrastructure available to the customers in the ISO parallel design allows for some over-subscription inside the room,” said Furlong. “In other words, if every row went to max power, you could exceed the contracted capacity available for the room. The rooms have double the distribution board capacity, which means this over-subscription doesn’t trip the room level breaker. Because the rooms are fed by a ring bus, and some customers may never exceed their contracted capacity, in theory there is capacity available if you exceed the capacity of your room.”
Like any good landlord, Fateh doesn’t identify the tenant in the leap second incident. Like any cautious data center manager, Furlong doesn’t get into details about his company’s power agreements. But it doesn’t take a lot of imagination to conclude that Facebook benefited from the ISO-parallel design during the server storm last July 1.
S.U.Posted August 7th, 2013
Is this article suggesting fb found a way to not have facilities fight operations every step of the way, even though both teams work for the same company? Maybe I should stop ignoring the fb recruiters.