Facebook Open Sources Data Center Network Fault Detection Tools

NetNORAD system detects and localizes device failure in social network’s global data center infrastructure within seconds

Yevgeniy Sverdlik

February 18, 2016

2 Min Read

Facebook Open Sources Data Center Network Fault Detection Tools

Facebook’s Six Pack switch is a 7RU chassis that includes eight of its Wedge switches and two fabric cards (Photo: Facebook)

Several years ago Facebook shut down an entire data center to test the resiliency of its application. According to Jay Parikh, the company’s head of engineering, the test went smoothly. The data center going offline did not disrupt anybody’s ability to mindlessly scroll through their Facebook feed instead of spending time being a contributing member of society.

Facebook and other web-scale data center operators, companies that built global internet services that make billions upon billions of dollars, have shifted the data center resiliency focus from redundancy and automation of the underlying infrastructure – the power and cooling systems – to software-driven failover. A globally distributed system that consists of so many servers can easily lose some of those servers without any significant impediment to the application’s performance.

That’s not to say they’ve abandoned backup generators, UPS systems, and automatic transfer switches. You’ll still see all of those things in Facebook data centers; it’s just that they are no longer the single line of defense.

Today, Facebook open sourced some of the software tools it has built in-house that help its engineers detect the location of an outage within its infrastructure down to a single cluster of servers within a matter of seconds, isolate the fault, and avoid a wider-scale issue.

The tools are parts of a system called NetNORAD, which constantly monitors the entire Facebook data center infrastructure for packet loss rates and latency. Using data analytics, it detects abnormal patterns and triggers alarms, usually within 30 second of a fault.

“Our scale means that equipment failures can and do occur on a daily basis, and we work hard to prevent those inevitable events from impacting any of the people using our services,” Petr Lapukhov, a network engineer at Facebook, wrote in a blog post. “The ultimate goal is to detect network interruptions and automatically mitigate them within seconds. In contrast, a human-driven investigation may take multiple minutes, if not hours.”

The components of NetNORAD Facebook is open sourcing are pinger and responder, the system that has a set of servers (pingers) continuously reach out to all servers in Facebook data centers and generates packet loss and latency data based on the responses they receive, and fbtracert, the tool that automatically determines the exact location of a fault.

For more details on how NetNORAD works, read Lapukhov’s blog post.

About the Author(s)

Yevgeniy Sverdlik

See more from Yevgeniy Sverdlik

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

Facebook Open Sources Data Center Network Fault Detection Tools

About the Author(s)

Editor's Choice

Industry Voices

Featured How Tos

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

<span class="ArticleBase-LargeTitle">Facebook Open Sources Data Center Network Fault Detection Tools</span>Facebook Open Sources Data Center Network Fault Detection Tools

About the Author(s)

Editor's Choice

Industry Voices

Featured How Tos

Facebook Open Sources Data Center Network Fault Detection Tools