Chaos Kong is Coming: A Look At The Global Cloud and CDN Powering Netflix
October 17th, 2013 By: Rich Miller
NEW YORK - Netflix is continuing to expand its infrastructure, both on the Amazon Web Services cloud and in data centers around the world. As part of this expansion, the famous Chaos Monkey and Chaos Gorilla are getting a beefy new relative: Chaos Kong.
The streaming video titan’s infrastructure was the focus of a presentation at O’Reilly Velocity 2013 NYC conference by Jeremy Edberg, who heads the site reliability team at Netflix. Edberg, who was also the first paid employee at Reddit, gave a wide-ranging talk on how Netflix manages its huge operation and the role of developers in the process.
Netflix sees about 2 billion requests per day to its API, which serves as the “front door” for devices requesting videos, and routes the requests to the back-end services that power Netflix. That activity generates about 70 to 80 billion data points each day that are logged by the system.
“We like to say that we’re a logging system that also plays movies,” said Edberg. “We pretty much automate everything. That’s really the key. When there’s an operations task, we try to figure out how to automate it.”
Simian Army Keeps Growing
That includes the Chaos Monkey, a resiliency tool that randomly disables virtual machine instances that are in production on the Amazon cloud. The goal is to engineer applications so they can tolerate random instance failures. It’s one of a suite of Netflix tools known as the Simian Army, which also includes the Chaos Gorilla, which disables an entire AWS Availability Zone. Each of Amazon’s regions includes a number of Availability Zones (AZs) to allow users to create failover options in the event of a local outage.
Since Netflix is now running in three different Amazon regions (Virginia, Oregon and Europe-West in Dublin), it has developed Chaos Kong, a tool that will simulate an outage effecting an entire Amazon region and then shift traffic to the remaining regions. Netflix uses Amazon’s Reserved Instances to ensure that it will have capacity available for a wholesale shift of traffic from one region to another.
Here’s a roundup of some of the other key points from Edberg’s talk:
Powered by Cloud (and a Huge CDN): Netflix has become the poster child for Amazon Web Services and the cloud-driven company. But much of its content is served through data centers. “We say we run everything in the cloud, but that’s really just the control plane,” said Edberg. “All the video bits are coming from a CDN we’ve built. We have servers running all around the world in remote data centers.” The Netflix CDN, known as Open Connect, is housed in 21 data centers around the world – including facilities from Equinix, Telecity, Telx, Telehouse, CoreSite, Verizon/Terremark and Global Crossing – as well as ISPs and networks. Proxy services handle the “conversations” between AWS and the data centers. Open Connect uses a 4U appliance built and designed by Netflix and using components from Supermicro, Intel, Hitachi and Seagate.
DevOps At Netflix: The developer teams at Netflix deploy upwards of 100 releases per day. The company follows a “DevOps” model in which developers both write and deploy code. You build it, deploy it, and if you break it, you fix it, said Edberg. “We hire responsible adults and trust them to do what they’re supposed to be doing,” he said. “It works pretty well. The developers get to deploy into production whenever they want. If something breaks, you also have to fix it, even if it’s 4 am.”
Not that this process doesn’t get interesting at times. “(Developers) are good at knowing the risk to their service,” said Edberg. “One of the downsides with this distributed infrastructure is that you may not always know how your changes will affect downstream or upstream dependencies.” While many services are limited in their scope – and hence the amount of trouble a wayward deployment can create – a configuration tool known as Fast Properties allows developers to broaden the scope of their system changes. “A non-trivial amount of outages are due to Fast Properties changes where someone deploys globally or doesn’t understand a dependency,” said Edberg. “We’re trying to make it smarter.”
Redundancy and the Rule of Three: ”We never ever save data on a single machine,” said Edberg. “We always try to make sure we have three of everything. We’re going to ask you to run things in three availability zones, so they run in three data centers.” The company’s Cassandra database architecture runs in three different regions.
For all of Netflix’s technical accomplishments, Edberg its business model creates a challenge: the actual cost of downtime is hard to calculate. The company’s revenue is based on monthly subscriptions, rather than daily or hourly transactions. Cancellations are the key metric, and can’t be neatly attributed to downtime.
Want to use the Netflix tools mentioned in this article? Check out Netflix on GitHub to see open source versions of some of these tools.
Thanks for the great article Rich! This was actually pretty representative of what I said, and you clearly did some research outside of the talk, which is great (and not what most folks do these days). Thanks.
Thanks, Jeremy! It was a really interesting talk, and it’s impressive to see how much material Netflix has shared about how it builds and operates its infrastructure. Feel free to share the story on your favorite social media sites.