Facebook Open Sources Data Center Network Fault Detection Tools

Several years ago Facebook shut down an entire data center to test the resiliency of its application. According to Jay Parikh, the company’s head of engineering, the test went smoothly. The data center going offline did not disrupt anybody’s ability to mindlessly scroll through their Facebook feed instead of spending time being a contributing member of society.

Facebook and other web-scale data center operators, companies that built global internet services that make billions upon billions of dollars, have shifted the data center resiliency focus from redundancy and automation of the underlying infrastructure – the power and cooling systems – to software-driven failover. A globally distributed system that consists of so many servers can easily lose some of those servers without any significant impediment to the application’s performance.

That’s not to say they’ve abandoned backup generators, UPS systems, and automatic transfer switches. You’ll still see all of those things in Facebook data centers; it’s just that they are no longer the single line of defense.

Today, Facebook open sourced some of the software tools it has built in-house that help its engineers detect the location of an outage within its infrastructure down to a single cluster of servers within a matter of seconds, isolate the fault, and avoid a wider-scale issue.

The tools are parts of a system called NetNORAD, which constantly monitors the entire Facebook data center infrastructure for packet loss rates and latency. Using data analytics, it detects abnormal patterns and triggers alarms, usually within 30 second of a fault.

“Our scale means that equipment failures can and do occur on a daily basis, and we work hard to prevent those inevitable events from impacting any of the people using our services,” Petr Lapukhov, a network engineer at Facebook, wrote in a blog post. “The ultimate goal is to detect network interruptions and automatically mitigate them within seconds. In contrast, a human-driven investigation may take multiple minutes, if not hours.”

The components of NetNORAD Facebook is open sourcing are pinger and responder, the system that has a set of servers (pingers) continuously reach out to all servers in Facebook data centers and generates packet loss and latency data based on the responses they receive, and fbtracert, the tool that automatically determines the exact location of a fault.

For more details on how NetNORAD works, read Lapukhov’s blog post.

Comments

Plain text