Once a cloud services company reaches a certain size, the web-scale data center economics starts to make a lot of sense, and it looks like Salesforce is the latest major service provider to have crossed the threshold.
The company is going through a “massive transformation” of the way it runs infrastructure, going from lots of specialized custom server specs and manual configuration work to the approach web-scale data center operators like Google and Facebook use, standardizing on bare-bones servers and implementing sophisticated data center automation tools, many of them built in-house, according to TJ Kniveton, VP of infrastructure engineering at Salesforce.
Kniveton told about the transition while sitting on a panel with infrastructure technologists from Google and Joyent at the recent DCD Internet conference in San Francisco.
His team is looking at all the infrastructure innovation web-scale data centers have done over the past 10 years or so and adapting a lot of it for its own needs. It is a wholesale redefinition of the relationships between data center providers (Salesforce uses data center providers as opposed to building its own), hardware and software vendors, and developers that build software that runs on top of that infrastructure, Kniveton said.
One big change is transitioning to a single server spec as opposed to having a different server configuration for every type of application. When Microsoft announced it was joining Facebook’s Open Compute Project, the open source data center and hardware design initiative, early last year, it kicked off a similar change in strategy, standardizing on a single server design across its infrastructure to leverage economies of scale.
Facebook’s approach is slightly different. While it has standardized servers to a high degree, it uses several different configurations based on the type of workload each server processes.
Kniveton wants to take standardization further at Salesforce, standardizing on a single spec. He did not provide details about the design, but said there were lots of benefits to cutting down to one configuration.
Another big change is relying much more on software for things like reliability and general server management. Like web-scale operators, Salesforce is going to rely on software to make its applications resilient rather than ensuring each individual piece of hardware runs around the clock without incident.
“I’m doing a lot more software development now than we’ve ever done before,” Kniveton said.
Automation: the Glue Between Applications and Infrastructure
Much of that software work goes into data center automation so that computers can do the manual work of the human system administrators. His goal ultimately is not just automation of simple tasks but creating autonomous systems that configure themselves to provide the best possible infrastructure for the application at hand.
The efforts at Salesforce rely to some extent on open source technologies, but his team found that not everything they need is open source. “There are building blocks out there,” he said, but the team still has to create a lot of technology in-house, and he hopes to open source some of it in the future.
Data center automation is crucial to the web-scale approach. As Geng Li, Google’s CTO of enterprise infrastructure, put it, automation is the glue that keeps everything together. “It’s not just a set of technologies you buy from a vendor or a set of vendors,” Li said. It’s about having a software-oriented operations team to “glue” the workload the data center is supporting to the infrastructure.
Automation enables a single admin to manage thousands of servers, which is the only way to manage infrastructure at such scale. There are no sys admins at Google data centers, Li said. There is a role at Google called Site Reliability Engineer. “Those guys are software developers,” he said. Such an engineer receives a service to support and it’s her or his responsibility to automate the infrastructure to properly support that service.
Automation also helps increase utilization rates of the infrastructure. It enables virtualization or abstraction of the physical pieces and creation of virtual pools of resources that can be used by applications. All flash memory capacity available in a cluster of servers, for example, can be treated as a single flash resource and carved up as such, as opposed to individual applications using some flash resources on certain servers, leaving a lot of free capacity idling.
The Risk of Ripple Effect
Obviously, the bigger and more automated the system, the bigger the magnitude of the impact if there is an issue. In a highly automated system where everything is interconnected, a single software bug can cascade in ways that were not anticipated by the software developers, causing widespread service outages.
Bryan Cantrill, CTO of the cloud provider Joyent and the third panelist, warned about the dangers of too much automation, where a single mistake can have a disastrous ripple effect across the entire infrastructure. “You are replacing the fat finger of the data center operator with the fat finger of the software developer,” he said.
Kniveton acknowledged the risk, saying the automated approach means more thinking needs to go into avoiding scenarios where effects of a single mistake can be greatly magnified. “With great power comes great responsibility,” he said.