A Storm of Servers: How the Leap Second Led Facebook to Build DCIM Tools

Cabinets filled with servers in a Facebook data center. The company is developing its own DCIM software, based on insights gained during last year's Leap Second bug. (Photo: Facebook)

ASHBURN, Va. - For data centers filled with thousands of servers, it's a nightmare scenario: a huge, sudden power spike as CPU usage soars on every server.

Last July 1, that scenario became real as the "Leap Second" bug caused many Linux servers to get stuck in a loop, endlessly checking the date and time. At the Internet's busiest data centers, power usage almost instantly spiked by megawatts, stress-testing the facility's power load and the user's capacity planning.

The experience is yielding insights into the operations of Facebook's data centers, and may result in new tools to help hyper-scale companies manage workloads. The Leap Second "server storm" has prompted the company to develop new software for data center infrastructure management (DCIM) to provide a complete view of its infrastructure, spanning everything from the servers to the generators.

For Facebook, the incident also offered insights into the value of flexible power design in its data centers, which kept the status updates flowing as the company nearly maxed out its power capacity.

The Leap Second: What Happened

The leap second bug is a time-handling problem that is a distant relative of the Y2K date issue. A leap second is a one second adjustment that is occasionally applied to Universal Time (UTC) to account for variations in the speed of earth’s rotation. The 2012 Leap Second was observed at midnight on July 1.

A number of web sites immediately experienced problems, including Reddit, Gawker, Stumbleupon and LinkedIn. More significantly, air travel in Australia was disrupted as the IT systems for Qantas and Virgin Australia experienced difficulty handling the time adjustment.

What was happening? The additional second caused particular problems for Linux systems that use the Network Time Protocol (NTP) to synchronize their systems with atomic clocks. The leap second caused these systems to believe that time had "expired," triggering a loop condition in which the system endlessly sought to check the date, spiking CPU usage and power draw.

As midnight arrived in the Eastern time zone, power usage spiked dramatically in Facebook's data centers in Virginia, as tens of thousands of servers spun up, frantically trying to sort out the time and date. The Facebook web site stayed online, but the bug created some challenges for the data center team.

"We did lose some cabinets when row level breakers tripped due to high load," said Tom Furlong, the VP of Site Operations for Facebook. "The number of cabinets brought down was not significant enough for it to impact our users."

Facebook wasn't the only one seeing a huge power surge. German web host Hetzner AG said its power usage spiked by 1 megawatt - the equivalent of the power usage of about 1,000 households. The huge French web host OVH, which was running more than 140,000 servers at the time, also reported a huge power spike.

Electric power is the most precious commodity in a server farm. The capacity and cost of power are the guiding decisions for most data center customers. The Leap Second bug raised a key question: what happens if a huge company unexpectedly maxes out its available power?

The Power Perspective

It's the type of question that brings wonky debates about power system design into sharp relief. Fortunately, it wasn't an academic question for the team at DuPont Fabros Technology (DFT).

DuPont Fabros builds and manages data centers that house many of the Internet's premier brands, including Apple, Microsoft, Yahoo and Facebook. As the leap second bug triggered, the power usage surged within one of DFT's huge data centers in Ashburn, Virginia.

Hossein Fateh, the President and CEO of DuPont Fabros, said one of the tenants in the building suddenly saw its power load surge from 10 megawatts to 13 megawatts, and stay there for about 5 hours.

"The building can take it, and that tenant knew it," said Fateh. "We encourage tenants to go to 99 and 100 percent (of their power capacity). Some have gone over."

Fateh's confidence is rooted in DFT's approach to power infrastructure. It builds its data centers using what's known as an ISO-parallel design, a flexible approach that offers both the redundancy of parallel designs and the ability to isolate problems.

Multi-tenant data centers like DFT's contain multiple data halls, which provide customers with dedicated space for their IT gear. The ISO-parallel design employs a common bus to conduct electricity for the building, but also has a choke that can isolate a data hall experiencing electrical faults, protecting other users from power problems. But the system can also "borrow" spare capacity from other data halls.

"The power infrastructure available to the customers in the ISO parallel design allows for some over-subscription inside the room," said Furlong. "In other words, if every row went to max power, you could exceed the contracted capacity available for the room. The rooms have double the distribution board capacity, which means this over-subscription doesn’t trip the room level breaker. Because the rooms are fed by a ring bus, and some customers may never exceed their contracted capacity, in theory there is capacity available if you exceed the capacity of your room."

Like any good landlord, Fateh doesn't identify the tenant in the leap second incident. Like any cautious data center manager, Furlong doesn't get into details about his company's power agreements. But it doesn't take a lot of imagination to conclude that Facebook benefited from the ISO-parallel design during the server storm last July 1.

Next: Facebook's New Focus: DCIM

Rows of cabinets inside one of the data halls in a Facebook data center. (Photo: Rich Miller)

Facebook's New Focus: DCIM

One outcome of the Leap Second is that Facebook is focused on making the best possible use of both its servers and its power - known in IT as "capacity utilization." When Furlong looked at the spike in Facebook's data center power usage, he saw opportunity as well as risk.

"Most of our machines aren’t at 100 percent of capacity, which is why the power usage went up dramatically," said Furlong. "It meant we had done good, solid capacity planning. But it left us wondering if we were leaving something on the table in terms of utilization."

That has led Facebook to develop its own tools for DCIM software, which packages information about IT systems (servers, storage and network) with data about building operations, including the data center's power and cooling systems. The goal is to provide a seamless, real-time view of all elements of data center operations, allowing data center managers to quickly assess a range of factors that impact efficiency and capacity.

"A lot of this plays into our efficiency initiatives," said Furlong, who said his team must balance the need for both flexibility and efficiency. "Through the leap second, we found that we were good on flexibility. One of the efficiencies we now want to drill into is how we use the building."

Efficiency, CapEx and the Single Pane of Glass

Efficiency has been a particular focus for Facebook, which has designed its own servers, storage and data centers to optimize nearly every element of its operations. With its DCIM initiative, which is still in beta, Facebook is focused on harnessing data to improve how it models its data center operations. Better modeling will allow Facebook to be more granular in its management of cluster operations, power-efficient load balancing, and power capping of servers - all of which could improve performance while reducing energy use. The way Facebook sees it, better utilization translates into lower spending on capital expenditures.

This quest for a "single pane of glass" has long been the Holy Grail of data center management software. But no vendor has yet created a true killer product - which is one reason that there are more than 70 players in the market for DCIM software, including software companies, vendors of power equipment and monitoring tools, and even colocation providers. This is largely due to the complexity of the data center ecosystem, and the many existing IT packages and building management systems (BMS) that must work together.

"No BMS is truly off the shelf," said Furlong. "As we went through this process, we realized that each has capabilities that the other doesn’t."

The best solution wound up being a combination of third-party software and in-house tools developed by the Facebook team. “We are still in beta and have a lot of pieces coming together," said Furlong.

Furlong shared an overview of Facebook's DCIM initiative at the recent DataCenterDynamics event in San Francisco, and plans to share more details at the Open Compute Summit in January 2014 in San Jose. The Open Compute Project was formed to share Facebook's server and data center designs and create an "open hardware" ecosystem to boost innovation in hyper-scale computing.

It's not yet clear whether Facebook will open source its DCIM software, which ties together components that may be proprietary. But the plans to discuss the project at the Open Compute Summit suggest that some type of DCIM tools will emerge from the effort.

Which would serve as evidence that sometimes, the landscape can be changed in a single second.

Comments

Plain text