We've written many times about the breadth of Google's data center infrastructure and its focus on reliability. So how does a widely-used app like Gmail go down, as it has today? There have been a number of Gmail outages over the years, usually involving software updates or networking issues. Or in some cases, a software update causing a networking issues.
The reports of a Gmail outage are widespread, but don't appear to be uniform. Some users are able to access their Gmail boxes (as I just did). Google is acknowledging reports of issues, but not really confirming them yet. "We're investigating reports of an issue with Google Mail," the company said on its status dashboard. "We will provide more information shortly."
UPDATE: As of 1:10 p.m. Eastern, Google says Gmail is back up. "The problem with Google Mail should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better."
On at least three occasions, Gmail downtime has been traced back to software updates in which bugs triggered unexpected consequences. A pair of outages in 2009 involved routine maintenance in which bugs caused imbalances in traffic patterns between data centers, causing some of the company's legendary large pipes to become clogged with traffic. That was the case in Febuary 2009, when a software update overloaded some of Google's European network infrastructure, causing cacading outages at its data centers in the region that took about an hour to get under control.
In Sept. 2009, Google underestimated the impact of a software update on traffic flow between network equipment, overloading key routers. One element of that outage may offer clues to today's issues. In that event, the Gmail web interface was unavailable, even as access to IMAP and POP continued to work - which is also being reported with today's issues. It turns out the web and IMAP/POP traffic uses different routers. In the Sept. 2009 outage, Google addressed the problem by throwing more hardware at it, adding routers until the situation stabilized.
Despite the sophistication of Google's networks, updates sometimes bring surprises.
"Configuration issues and rate of change play a pretty significant role in many outages at Google," Google data center exec Urs Holzle told DCK in a 2009 interview. "We’re constantly building and re-building systems, so a trivial design decision six months or a year ago may combine with two or three new features to put unexpected load on a previously-reliable component. Growth is also a major issue – someone once likened the process of upgrading our core websearch infrastructure to “changing the tires on a car while you’re going at 60 down the freeway.” Very rarely, the systems designed to route outages actually cause outages themselves."
But don't worry that Gmail might lose your data. In addition to storing multiple copies of customer data on disk-based storage, Google also backs up your data to huge tape libraries within its data centers. The company restored some customer data from tape in a 2011 outage, also caused by a software bug.