If you’re a Google software developer working on the company’s products, the company not only wants you to know how its global-scale infrastructure operates, it wants you to run that infrastructure yourself.
In an unusual practice, Google has a program that has engineers that work on product development work for six months on the team that operates its infrastructure, which consists of a global network of company-owned and leased data centers.
Borrowing from NASA, the program is called Mission Control, and its goal is to have more of its engineers understand what it’s like to build and operate a high-reliability service at Google’s massive scale, according to a Monday post on the Google Cloud Platform Blog by one of the engineers who is about to begin his six-month Mission Control stint in Seattle.
When they are embedded on the Google infrastructure team, the engineers find that the people working there speak the same language. As Google has been explaining in conferences in recent years, it doesn’t have sysadmins. The company uses software engineers to design and run the software that operates the infrastructure inside its data centers because it believes they are better at it.
“It turns out services run better when people who understand software also run it,” Melissa Binde, Google’s director of Site Reliability Engineering, said during a presentation at the company’s GCP Next conference in San Francisco in March. “They have a deep understanding of what makes it tick; they have a deep understanding of the interactions.”
Their title is Site Reliability Engineer, a concept created by Google to describe a software engineer who designs and runs infrastructure. In a way, SREs are Google’s answer to DevOps, which also seeks to address the conflicting goals between sysadmins and developers in companies with the traditional organizational structure.
The problem with DevOps, Binde said, is that it means different things to different people. Site Reliability Engineering is so precisely defined by Google, the company published a book on it earlier this year, aptly titled Site Reliability Engineering.
When developers and sysadmins are divided into separate groups, each with its own culture, they are not incentivized to help one another; they’re incentivized to do the opposite.
As Binde put it, developers get “cookies” for releasing new features, while sysadmins get cookies for maintaining uptime. The more frequently new features come out, the harder it is to maintain uptime.
“The sysadmins will get more cookies if they can prevent features from going out, and the developers will get more cookies if they can figure a way around the sysadmin’s rules,” she said.
The opposing camps come up with different ideas, such as calling a new feature “beta,” which often means it can get released faster, without going through a rigorous sysadmin process for testing features before they’re launched in production. Meanwhile, sysadmins demand launch reviews and stretch them out as much as possible to delay deployment. Cumulatively, all these efforts result in stalled progress, or, as Binde put it, long periods of cookie-less sadness for everyone.
Watch Binde’s presentation in full: