Web caching: Facebook’s Problem of a Thousand Servers

Distributed computing systems at extreme scale are governed by a different set of rules than small-scale systems. One of the differences Facebook has discovered is the role of web caching.

Traditionally used to take some load off database servers and load websites quicker, web caching has become a necessity for the site rather than a neat nice-to-have optimization feature. And Facebook isn’t the only one. Caching has taken a key infrastructure role at Twitter, Instagram and Reddit as well.

Facebook infrastructure engineers have built a tool called mcrouter to manage caching. Earlier this month, the company open sourced the code for mcrouter, making the announcement at its @Scale conference in San Francisco.

It is essentially a memcached protocol router that manages all traffic to, from and between thousands of cache servers in dozens of clusters in Facebook data centers. It makes memcached possible at Facebook's scale.

Memcached is a system for caching data in server memory in distributed infrastructure. It was originally developed for LiveJournal but is now an indispensable part of infrastructure at many major Internet companies, including Zynga, Twitter, Tumblr and Wikipedia.

Instagram adopted mcrouter for its infrastructure when it was running on Amazon Web Services, before it was moved to Facebook data centers. Reddit, which runs on AWS, has been testing mcrouter and plans to implement it at scale in the near future.

Formalizing open source software for web scale

Facebook uses and creates a lot of open source software to manage its data centers. At the same conference the company announced a new initiative, formed together with Google, Twitter, Box and Github, among others, to make open source tools, such as mcrouter, easier to adopt.

Together, the companies formed an organization called TODO (Talk Openly, Develop Openly) which will act as a clearinghouse of sorts. Details about its plans are scarce, but in general, the organization will develop best practices and endorse certain open source projects so users can have certainty that what they are adopting has been used in production by one or more of its members.

Where “Likes” live

Mcrouter has become necessary at Facebook as the site added certain features. Rajesh Nishtala, a software engineer at Facebook, said, one of these features was social graph, the application that tracks connections between people, their tastes and things they do when using Facebook.

The social graph contains people’s names, their connections and objects, which are photos, posts, Likes, comment authorship and location data. “These small pieces of data … are what makes caching important,” Nishtala said.

These bits of data are stored on Facebook’s cache servers and get pulled from them every time a user’s device loads a Facebook page. The cache tier performs more than 4 billion operations per second, and mcrouter is what ensures the infrastructure can handle the volume.

From load balancing to failover

Mcrouter is a piece of middleware that sits between a client and a cache server, communicating on the cache’s behalf, Nishtala explained. It has a long list of functions, three of the most important ones being cache connection pooling, splitting of workloads into “pools” and automatic failover.

Pooling cache connections helps maintain site performance. If every client connected directly to a cache server on its own, the cache server would get easily overloaded. Mcrouter runs as a proxy that allows clients to share connections, preventing such overloads.

When a variety of workloads are competing for cache memory space, the middleware splits them into pools and assigns those pools and distributes them across multiple servers.

If a cache server goes down, mcrouter automatically switches to a backup cache. Once it has, it continuously checks whether connection to the primary server is back.

It can also do “pool-level” failover. When an entire pool of cache servers is inaccessible, mcrouter automatically shifts the workload to a pool that’s available.

Reddit expects mcrouter to make AWS easier

Reddit has been testing mcrouter on one production pool, Ricky Ramirez, systems administrator at Reddit, said.

The site scales from about 170 to 300 application servers on AWS every day and has about 70 backend cache nodes with about 1TB of memory. Ramirez’ team (three operations engineers) relies on memcached for a lot of things.

The pain point they are addressing with mcrouter is their inability to switch to new instance types Amazon constantly cranks out. “It’s very stressful and takes a lot of time out of the operations engineers,” Ramirez said.

After a successful test run on one production pool, doing about 4,200 operations per second, the team plans to use mcrouter a lot more. The next pool that involves Facebook’s middleware will do more than 200,000 operations per second.

The engineers plan to use it to test new cloud VM instance types and to replace servers seamlessly, without downtime. By offloading some of the complexity of managing changes in the infrastructure, Ramirez said he expects to see significant performance gains.

More details on mcrouter on the Facebook Engineering Blog

Comments

Plain text