“Undifferentiated heavy lifting.” This is how Amazon Web Services has characterized the work normally conducted by IT departments in enterprise data centers since the launch of its public cloud service.
Move to the cloud and “you get to take that scarce resource — your software development engineers — and instead of having them work on the undifferentiated heavy lifting of infrastructure, you get to work on what differentiates your business,” remarked AWS CEO Andy Jassy during his company’s 2015 re:Invent keynotes. “And then you get to deploy your application on worldwide infrastructure.”
Earlier that year, Dropbox — a cloud data storage service founded on AWS — launched its effort to move the other direction. It foresaw a time when it could no longer scale its services in sync with customer demand, despite Jassy’s promises that Amazon’s resources were effectively infinite. Between February and October of 2015, Dropbox successfully relocated 90 percent of an estimated 600 petabytes of its customer data to its in-house network of data centers dubbed Magic Pocket.
“It was clear to us from the beginning that we’d have to build everything from scratch,” wrote Dropbox infrastructure VP Akhil Gupta on his company blog in 2016, “since there’s nothing in the open source community that’s proven to work reliably at our scale. Few companies in the world have the same requirements for scale of storage as we do.”
Statements like that provoked some to see Dropbox’s reverse migration as a kind of one-off event. Unlike the everyday enterprise in an industry like petroleum, healthcare, or insurance, Dropbox had an interest in directly engineering service advantages into its own system. Its move out of the cloud was not widely perceived as a journey from which the rest of the IT world could learn any pertinent lessons.
Five years later, many more organizations, including those outside of the IT industry, are discovering that limits to their infinite scalability in the public cloud do exist. There could be valid reasons for, to put it romantically, a journey home. Now the Magic Pocket journey is starting to look more like a pioneer expedition.
“Dropbox made a plan. We looked at what our capacity is, what we anticipated our growth to be,” Latane Garetson, Dropbox’ head of data center physical infrastructure, told Data Center Knowledge. As customary, the team built a model for data center capacity planning.
That Dropbox went through this exercise shouldn’t come as a shock to anyone who reads a site like DCK on a regular basis. But the company’s approach to capacity planning today might be a surprise. Garetson’s team operates in granular, near-term planning windows — often six months, sometimes three – much shorter than common-practice.
“It’s kind of a backwards approach,” he admitted. They still have their annual growth forecasts, but those forecasts are updated frequently throughout the year, based on communication with the software team and anticipated lead times (“how long it takes to buy capacity, and when we can actually land capacity and get it available”).
“It’s continuing to work with our software group, and simplify everything within Dropbox internally: saying, ‘What are we expecting this year?’ and do a capacity model forecast that is always continuing to be updated — monthly, weekly, quarterly. The data center team is always integrated into that process.”
For many IT shops, deployment starts with either acquiring or building hardware, staging it, deploying it, and then integrating it into service clusters and infusing it with software. Dropbox’s approach works in the other direction, Garetson told us. The software team analyzes the active capacity forecasts first and devises a template for servers its data centers in various metropolitan areas will require. These developers do the “paperwork” engineering, including configuration of services and server buildouts.
Like customers placing orders at a drive-thru window, these configurations are then delivered to hardware teams in sequenced, and the hardware is produced to the software team’s specifications. Since the software team has already produced the configuration plan, new servers are self-configuring, requiring no configuration work from hardware engineers.
“The levers that we consider are our build times,” said Garetson. “We started out in 2015 with a six-month build time. We’ve moved it to three months now… We don’t take the yearly forecast and build off of that. We take the yearly and have a plan, but also have these critical milestones where, if the forecast changes, we can adjust it, because we only have a three-month build time.”
Shorter build times for Dropbox are easier to control. Rather than delaying a build for some unexpected reason, it can be cancelled, and a new plan launched in its place, without disruption. Taking a cue from how microservices orchestrators schedule distributed software components — cutting them off when they don’t respond — Dropbox has discovered it can be more flexible about its capacity goals when it sticks to a plan that is on schedule at all times rather than partially or entirely delayed.
The team closely monitors changes in demand for physical data center capacity at each of its sites. They model inventory on hand at any given time within a current window and when new capacity can be made available. Using this method, Dropbox never plans for capacity goals in the following year. Like building a massive wall one brick at a time, it focuses only on the near-term, planning for no longer than six months at a time.
As in most storage-focused data centers, rack density in Dropbox’s facilities has increased in the last five years. That has enabled its total rack footprint to shrink, while growing customer storage capacity at petabyte scale. “Now we’re able to have a little more physical space,” Garetson said. “Where we would have to plan for 100 cabinets, now we can plan for 30.”
There is nothing so peculiar about this method of dynamic growth and capacity modeling that it should only fit Dropbox, or other cloud storage firms. Up to now, most enterprise data centers have been designed in terms of space and power allocated to critical services and applications. Dropbox calculates how much critical service it will provide in the form of a growth vector, only the first six months of which are actionable. If an organization became capable of calculating its IT service output quantitatively, as Dropbox does, it could devise a similar planning model for itself. In so doing, it may find that it can control costs while maintaining its services on-premises, avoiding the breaking point of cloud-based service affordability that Dropbox never faced.