SAN ANTONIO – Facebook has been an industry leader in building its Internet infrastructure for scalability. That includes the scalability of the people that work in the company’s data centers.
Each Facebook data center operations staffer can manage at least 20,000 servers, and for some admins the number can be as high as 26,000 systems, according to Delfina Eberly, Director of Data Center Operations at Facebook. Eberly was the keynote speaker Tuesday morning at the 7×24 Exchange 2013 Fall Conference, speaking on “Operations at Scale.”
Facebook’s performance appears to break new ground in the server-to-admin ratio, which has rarely exceeded 10,000 to 1 (see High Scalability for more). The company’s success affirms the potential of using an integrated approach in which the operations team works closely with other teams in IT and facilities.
Data center operations is a critical skill at Facebook, which now has 1.15 billion users, including 720 million who log in daily. Each day, Facebook users share 4.75 billion content items and “like” 4.5 billion items. The company now stores more than 240 billion photos, and adds 7 petabytes of photo storage each month.
To manage all that activity, Facebook has developed software to automate many aspects of data center operations. That includes software known as CYBORG, which detects problems with servers and attempts to fix the problems. If CYBORG exhausts automated repair options, it will send an alert to the ticketing system to dispatch a data center staffer to investigate the issue.
“Our goal is not to deploy a technician to the data center floor unless they actually have to physically handle a server,” said Eberly.
“We want to hang onto our talent,” she said. “The way you do that is to give them the opportunity to work on high-value tasks. We want them to stay and improve. This matters to us.”
Eberly is a veteran of the data center industry, beginning her career at McKesson in 1998, followed by stints at colocation pioneer Exodus Communications and Critical Path.
Design Supports Serviceability
At Facebook, the operations team’s time and workloads are considered during Facebook’s hardware design. An example: all servers are designed to be serviced from the front, so data center staffers have no need to enter the hot aisle. The server is designed so drives and components can be replaced without tools. The result: Facebook has reduced the time needed to repair servers by 54 percent.
The Facebook operations team carefully tracks equipment failure rates, and the data is reviewed when the company makes supply chain decisions, Eberly said. The company’s asset management and ticketing systems track hard drives and other components by serial numbers, providing a complete insight into the life cycle of each piece of hardware.
Eberly said these systems are sophisticated, but didn’t require an army of developers. Facebook has three software engineers dedicated to the operations team. “They’re absolutely vital to the work we do in the data centers,” she said.