Just as there are many reasons to move applications from internal data centers into the cloud, there are many reasons to move the other way. The recent migration of Instagram’s entire backend stack from Amazon Web Services’ public cloud into Facebook’s data centers was a good example of the latter.
As they did, Instagram engineers hit quite a few unexpected snags and were forced to think outside the box and come up with a few sophisticated workarounds to make it work. Their story is a good reminder that workload mobility in the cloud remains an uneasy challenge cloud service providers and the huge ecosystem of vendors building solutions for cloud infrastructure users have yet to solve.
Facebook founder and CEO Mark Zuckerberg indicated that Instagram would eventually take advantage of Facebook’s engineering resources and infrastructure in 2012, following the acquisition of the online photo sharing service by the social networking giant.
Not as easy as it first seemed
The team decided to move to make integration with internal Facebook systems easier and to be able to use all the tooling Facebook’s infrastructure engineering team had built to manage its large-scale server deployments. Following the acquisition, the engineering team found a number of integration points with Facebook’s infrastructure they thought could help accelerate product development and increase security.
The migration project did not turn out as straight-forward as one would expect. “The migration seemed simple enough at first: set up a secure connection between Amazon’s Elastic Compute Cloud (EC2) and a Facebook data center and migrate services across the gap piece by piece,” Instagram engineers Rick Branson, Pedro Cahauati and Nick Shortway wrote in a blog post published Thursday.
Forced to move to private Amazon cloud first
But they quickly learned that it was not quite so simple. The main problem at this first stage was that Facebook’s private IP space conflicted with EC2’s IP space. The solution was to move the stack into Amazon’s Virtual Private Cloud first and then migrate to Facebook using Amazon Direct Connect.
Direct Connect is a service Amazon provides at colocation data centers which is essentially a direct private network pipe between a customer’s servers and its public cloud. Targeted primarily at enterprises, it is designed to bypass the public Internet to avoid performance and security issues.
“Amazon’s VPC offered the addressing flexibility necessary to avoid conflicts with Facebook’s private network,” the engineers wrote.
EC2 not exactly best buds with Amazon’s VCP
But moving applications form Amazon’s public cloud infrastructure into a private cloud is also not as simple as it sounds. Instagram had many thousands of EC2 instances running, with more spinning up daily. To minimize downtime and simplify operation as much as possible, the team wanted EC2 and VPC instances to act as instances on the same network – and therein lied the problem.
“AWS does not provide a way of sharing security groups nor bridging private EC2 and VPC networks,” they wrote. “The only way to communicate between the two private networks is to use the public address space.” They took to Python and Zookeeper to write a “dynamic IP table manipulation daemon” called Neti, which provided the security group functionality they needed and a single address for every instance, regardless of which cloud it was running in.
After about three weeks, the migration into private cloud was complete, which the three engineers claim was the fastest VPC migration of this scale ever. The stack was ready for departure to its next destination: Facebook data centers.
Linux containers make custom tools portable
This step of the process was made more complex because the Instagram team wanted to keep all the management tools it had built for its production systems while running on EC2. These were things like configuration management scripts, Chef for provisioning and a tool called Fabric, which did everything from application deployment to database master promotion.
To port the tools into Facebook’s highly customized Linux-based environment, the team enclosed all of its provisioning tools in Linux Containers, which is how they now run on Facebook’s homegrown servers. “Facebook provisioning tools are used to build the base system, and Chef runs inside the container to install and configure Instagram-specific software,” they wrote.
One migration wiser
A project like this does not end without the team learning a thing or two, and the Instagram team walked away with a few takeaways. Some of the newly gained wisdom is to plan to change as little as possible to support the new environment; go for “crazy” ideas, such as Neti, because they might just work; make your own tools to avoid unexpected “curveballs”; and reuse familiar concepts and workflows to keep complexity to a minimum.