We’ve written many times about the breadth of Google’s data center infrastructure and its focus on reliability. So how does a widely-used app like Gmail go down, as it has today? There have been a number of Gmail outages over the years, usually involving software updates or networking issues. Or in some cases, a software update causing a networking issues.
Google is acknowledging reports of issues, which appear to be global. “We’re investigating reports of an issue with GMail,” the company said on its status dashboard. “We will provide more information shortly.”
UPDATE: Google says today’s outage, which lasted from 20 to 50 minutes for users, was caused by a software bug.
“At 10:55 a.m. PST this morning, an internal system that generates configurations—essentially, information that tells other systems how to behave—encountered a software bug and generated an incorrect configuration,” VP of Engineering Ben Treynor explained. “The incorrect configuration was sent to live services over the next 15 minutes, caused users’ requests for their data to be ignored, and those services, in turn, generated errors. Users began seeing these errors on affected services at 11:02 a.m., and at that time our internal monitoring alerted Google’s Site Reliability Team. Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users’ service was restored.”
On at least four previous occasions, Gmail downtime has been traced back to software updates in which bugs triggered unexpected consequences. A pair of outages in 2009 involved routine maintenance in which bugs caused imbalances in traffic patterns between data centers, causing some of the company’s legendary large pipes to become clogged with traffic. That was the case in Febuary 2009, when a software update overloaded some of Google’s European network infrastructure, causing cacading outages at its data centers in the region that took about an hour to get under control.
In Sept. 2009, Google underestimated the impact of a software update on traffic flow between network equipment, overloading key routers. In the Sept. 2009 outage, Google addressed the problem by throwing more hardware at it, adding routers until the situation stabilized.
In a December 2012 outage, the culprit was once again a software update causing a networking issue, this time in Google’s load balancers. ““A bug in the software update caused it to incorrectly interpret a portion of Google data centers as being unavailable,” Google reported.
Despite the sophistication of Google’s networks, updates sometimes bring surprises.
“Configuration issues and rate of change play a pretty significant role in many outages at Google,” Google data center exec Urs Holzle told DCK in a 2009 interview. “We’re constantly building and re-building systems, so a trivial design decision six months or a year ago may combine with two or three new features to put unexpected load on a previously-reliable component. Growth is also a major issue – someone once likened the process of upgrading our core websearch infrastructure to “changing the tires on a car while you’re going at 60 down the freeway.” Very rarely, the systems designed to route outages actually cause outages themselves.”
But don’t worry that Gmail might lose your data. In addition to storing multiple copies of customer data on disk-based storage, Google also backs up your data to huge tape libraries within its data centers. The company restored some customer data from tape in a 2011 outage, also caused by a software bug.