Google has published an update on this afternoon's Gmail downtime. "Today's outage was a Big Deal, and we're treating it as such," writes Ben Treynor, Google's VP of Engineering and Site Reliability Czar. "We're currently compiling a list of things we intend to fix or improve as a result of the investigation."
The problem? Treynor says Google underestimated the load that routine maintenance on "a small fraction" of Gmail servers would place on the routers supporting the application. "At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!'," Treynor wrote. "This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded."
The overloaded routers meant that users couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. But IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.
Google fixed the problem by bringing additional routers online, and says it has since increased Gmail's router capacity "well beyond peak demand to provide headroom." But more refinements are in the works to avert a repeat.
"We have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load)," Treynor reported.
Gmail also had an extended outage in February when a code change triggered “cascading problems” that overloaded a data center in Europe. For more background on that outage, aliong with a broader overview of Google’s incident management, see our interview with Google senior VP of operations Urs Hoelzle (How Google Routes Around Outages).