-
How Google Routes Around Outages
Making changes to Google’s search infrastructure is akin to “changing the tires on a car while you’re going at 60 down the freeway,” according Urs Holzle, who oversees the company’s massive data center operations. Google updates its software and systems on an ongoing basis, usually without incident. But not always. On Feb. 24 a bug in the software that manages the location of Google’s data triggered an outage in Gmail, the widely-used webmail component of Google Apps.
Just a few days earlier, Google’s services remained online during a power outage at a third-party data center near Atlanta where Google hosts some of its many servers. Google doesn’t discuss operations of specific data centers. But Holzle, the company’s Senior Vice President of Operations and a Google Fellow, provided an overview of how Google has engineered its system to manage hardware failures and software bugs. Here’s our Q-and-A:
Data Center Knowledge: Google has many data centers and distributed operations. How do Google’s systems detect problems in a specific data center or portion of its network?
Urs Holzle: We have a number of best practices that we suggest to teams for detecting outages. One way is cross monitoring between different instances. Similarly, black-box monitoring can determine if the site is down, while white-box monitoring can help diagnose smaller problems (e.g. a 2-4% loss over several hours). Of course, it’s also important to learn from your mistakes, and after an outage we always run a full postmortem to determine if existing monitoring was able to catch it, and if not, figure out how to catch it next time.
DCK: Is there a central Google network operations center (NOC) that tracks events and coordinates a response?
Urs Holzle: No, we use a distributed model with engineers in multiple time zones. Our various infrastructure teams serve as “problem coordinators” during outages, but this is slightly different than a traditional NOC, as the point of contact may vary based on the nature of the outage. On-call engineers are empowered to pull in additional resources as needed. We also have numerous automated monitoring systems built by various teams for their products, that directly alerts an on-call engineer if anomalous issues are detected.
DCK: How much of Google’s ability to “route around” problems is automated, and what are the limits of automation?
Urs Holzle: There are several different layers of “routing around” problems – a failing Google File System (GFS) chunkserver can be routed around by the GFS client automatically, whereas a datacenter power loss may require some manual intervention. In general, we try to develop scalable solutions and build in the “route around” behavior into our software for problems with a clear solution. When the interactions are more complex and require sequenced steps or repeated feedback loops, we often prefer to put a human hand on the wheel.
DCK: How might a facility-level data center power outage present different
challenges than more localized types of reliability problems? How does
Google’s architecture address this?Urs Holzle: The Google within-datacenter infrastructure (GFS, machine scheduling, etc) is generally designed to manage machine specific outages transparently, and rack/machine group outages as long as the mortality is a fraction of the total pool of machines. For example, GFS prefers to store replicated copies of data on machines on different racks so that the loss of a rack may create a performance degradation but won’t lose data.
Datacenter level and multi-region unplanned outages are infrequent enough that we use manual tools to handle them. Sometimes we need to build new tools when new classes of problems happen. Also, teams regularly practice failing out of or routing around specific datacenters as part of scheduled maintenance.
DCK: A “Murphy” question: Given all the measures Google has taken to prevent downtime in its many services, what are some of the types of problems that have actually caused service outages?
Urs Holzle: Configuration issues and rate of change play a pretty significant role in
many outages at Google. We’re constantly building and re-building systems, so a trivial design decision six months or a year ago may combine with two or three new features to put unexpected load on a previously-reliable component. Growth is also a major issue – someone once likened the process of upgrading our core websearch infrastructure to “changing the tires on a car while you’re going at 60 down the freeway.” Very rarely, the systems designed to route outages actually cause outages themselves; fortunately, the only recent example is the February Gmail outage (Here’s the postmortem in PDF format).DCK: How does Google respond to outages and integrate the “lessons learned” into its operations?
Urs Holzle: In general, teams follow a postmortem process when an outage occurs, and produce action items such as “monitor timeouts to X” or “document failover procedure and train on-call engineers”. Engineers from affected teams are also quite happy to ask for and supplement a post-mortem as needed. Human beings tend to be quite fallible, so if possible we like to write either a specific or a general automated monitoring rule to notice problems. This is true of both software/configuration problems and hardware/datacenter problems.
RELATED STORIES:
David
Posted March 25th, 2009This sort of outage should not be tolerated – decently equiped IT people do not put up with this sort of downtime or unexpected spike in capacity load.
While this sort of thing is allowed at google, those of us that work at major financial institutions do not allow the same fate to befall us, as it would often cost us our job or promotion.
Real geeks work at banks. They are essential to the US economy.
John
Posted March 26th, 2009@David
>Real geeks work at banks. They are essential to the US economy.
Probably not the best example given the current climate.
Introducing Google infrastructure; How Google routes around outages
Posted March 26th, 2009[...] the company’s massive data center operations. In a Q-and-A with Data Center Knowledge, Holzle discusses Google’s infrastructure, how it has engineered its system to route around hardware failures, and how it responds when [...]
Peter
Posted March 26th, 2009> Real geeks work at banks.
No, real geeks don’t fit into that corporate culture.
iWeb Blog » Nouvelles Techno iWeb: Zend, Google, publicités
Posted March 26th, 2009[...] Comment Google prévient les pannes [...]
David,
You ban must be amazing, and use abacuses. I suggest you read some tanambaum to understand how computers work.
You cannot predict the unexpected, but you can have a plan for hen unexpected things happen – which is what Google, and many others do.
Banks are one f the worst offenders for awful IT – I worked at one with £100bn in the payment systems, but no offsite DR, and another where budgets were so tight that untrained staff regularly caused outages.
I think the key thing with Google is that they have smart architects, and quality ops people – so it’s designed good, and the niggles get sorted quickly and made so they shouldn’t recur.
Dom
Bob
Posted March 26th, 2009>Real geeks work at banks. They are essential to the US economy
haha..thats funny that one actually made me laught a little.
Dillon
Posted March 26th, 2009>Real geeks work at banks. They are essential to the US economy.
Most banks don’t build an IT infrastructure from the filesystem up. I’ll cut them a little slack, they are pushing the limits. When I start seeing the same issue time and time again then its time to get all huffy.
BTW, how many times a day do you/your staff google for info to help maintain your companies IT system?
Raise Your Data Center Temperature | Michael Phillips Blog
Posted March 26th, 2009[...] following a link to a story about Google’s abilities to “route around outages” that Patrick had on his Blog, I saw a link to another story about Google’s Data Center [...]
Rick
Posted March 26th, 2009@John
Yes the financial crisis finds it’s roots in the IT departments of the banks. Perfect
>Real geeks work at banks. They are essential to the US economy.
Is this because they were so busy being geeky that they failed to achieve a decent level of intelligence? ie. Those who can do… those who can’t work at banks?
Google Glitch Rationalized : Beyond Search
Posted March 27th, 2009[...] Cathy of late. An interesting and insightful l example is “How Google Routes Around Outages” here. Writing in Data Center Knowledge, Rich Miller’s summary of a conversation with Googler Urs [...]
Ennuyer.net » Blog Archive » 2009-03-27- Today’s Ruby/Rails Reading
Posted March 27th, 2009[...] How Google Routes Around Outages « Data Center Knowledge [...]
Analytics Team » Blog Archive » How Google builds systems to route around outages
Posted March 28th, 2009[...] is more of a Big Data post than one about analytics. Recently Data Center Knowledge interviewed Urs Holzle who oversees Googles data center operations to find out how they handle [...]
Fast and furious « The Unofficial CTO Blog
Posted March 31st, 2009[...] and furious According to this article change management is a pretty interesting ride at Google. Making changes to Google’s search [...]
Tech News Highlights: Mar 24 - 30, 2009 « D.I.Y. Web Hosting Blog
Posted March 31st, 2009[...] Skype have released an iPhone application. – Google announces new venture capital fund. – How Google routes around outages. – Smartphones survive Pwn2Own contest. – WebHostingTalk industry forums [...]
Google Apps Downtime Report: Perfect Example? « UNIX Administratosphere
Posted March 8th, 2010[...] Data Center Knowledge summarized the event well in an article; they have also spoken with Google previously about how they handle [...]
How Google Routes Around Outages | External Brain
Posted August 31st, 2010[...] Making changes to Google’s search infrastructure is akin to changing the tires on a car while you’re going at 60 down the freeway, according Urs Holzle, who oversees the company’s massive data centre operations. Google updates its software and systems on an ongoing basis, usually without incident. But not always. On 24 February a bug in the software that manages the location of Google’s data triggered an outage in Gmail, the widely-used webmail component of Google Apps This entry was posted in Technology. Bookmark the permalink. ← Ontario Court Orders Web Site To Disclose Identity of Anonymous Posters Mythbusters ‘Big Bang’ Shatters Windows → [...]
Bandwidth Management, Cloud & the 405
Mobile Cloud Computing Will Soar
DCK Guide to Modular
Next Generation Preps for Cloud Containers
March 25th, 2009