Enterprises that sign up for Google’s cloud services will now have the choice to submit their software development and IT operations teams to the same level of operational rigor Google submits its own engineers to.
The company on Monday revealed more details about a new approach to cloud customer support it announced last week, created to help alleviate customers' anxiety about giving up control of their infrastructure to a cloud provider. It will embed its own experts on cloud customers’ teams to help them deploy and run applications in Google’s cloud data centers in the most reliable way possible.
The services will include shared paging (when things go wrong), auto-creation and escalation of priority-one tickets, participation in customer “war rooms,” and Google-reviewed design and production system.
The company will not charge a penny for what amounts to extremely hands-on professional services, but it doesn’t expect every customer to opt for them, given the level of commitment required on the customer’s part.
Dave Rensin, Google’s director of Customer Reliability of Engineering:
"This program won’t be for everyone. In fact, we expect that the overwhelming majority of customers won’t participate because of the effort involved. We think big enterprises betting multi-billion dollars businesses on the cloud, however, would be foolish to pass this up. Think of it as a de-risking exercise with a price tag any CFO will love."
Google has formed a new team to support this capability, called Customer Reliability Engineering. The title is a variation on Site Reliability Engineering, a concept created at Google years ago to describe software engineers responsible for building and operating Google’s global infrastructure. The company doesn’t differentiate between software development and IT and in fact prefers to have developers run infrastructure, assuming it’s a job handled better by people with deep understanding of software.
CREs will work with customers’ dev teams the same way SREs work with developers at Google. There is a set of ground rules both sides agree to commit to. SREs accept the responsibility for maintaining uptime and healthy operation of a system if:
- The system (as developed) can pass a strict inspection process — known as a Production Readiness Review (PRR)
- The development team who built the system agrees to maintain critical support systems (like monitoring) and be active participants in key events like periodic reviews and postmortems
- The system does not routinely blow its error budget
In a way, “error budget” is a different name for availability requirements, or SLAs, such as 99.9 or 99.999 percent uptime. Making Google developers on the product side of things responsible for reliability, SREs give them an error budget, and once they blow it, they have to spend all of their engineering time writing code that fixes the uptime problem they caused and make the system more stable overall.
If the developers don’t hold up their side of the bargain, the SREs are free to “hand back the pagers,” meaning they are no longer committed to, say, coming the rescue when something goes down at 3 a.m.
Google cloud customers that opt for working with the CRE team will have to agree to the same “social contract” in exchange for its services. Rensin:
"When a customer fails to keep up their end of the work with timely bug fixes, participation in joint postmortems, good operational hygiene etc., we'll 'hand back the pagers' too."