Router Ripples Cited in GMail Outage
September 1st, 2009 By: Rich Miller
Google has published an update on this afternoon’s Gmail downtime. “Today’s outage was a Big Deal, and we’re treating it as such,” writes Ben Treynor, Google’s VP of Engineering and Site Reliability Czar. “We’re currently compiling a list of things we intend to fix or improve as a result of the investigation.”
The problem? Treynor says Google underestimated the load that routine maintenance on “a small fraction” of Gmail servers would place on the routers supporting the application. “At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system ‘stop sending us traffic, we’re too slow!’,” Treynor wrote. “This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.”
The overloaded routers meant that users couldn’t access Gmail via the web interface because their requests couldn’t be routed to a Gmail server. But IMAP/POP access and mail processing continued to work normally because these requests don’t use the same routers.
Google fixed the problem by bringing additional routers online, and says it has since increased Gmail’s router capacity “well beyond peak demand to provide headroom.” But more refinements are in the works to avert a repeat.
“We have concluded that request routers don’t have sufficient failure isolation (i.e. if there’s a problem in one datacenter, it shouldn’t affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load),” Treynor reported.
Gmail also had an extended outage in February when a code change triggered “cascading problems” that overloaded a data center in Europe. For more background on that outage, aliong with a broader overview of Google’s incident management, see our interview with Google senior VP of operations Urs Hoelzle (How Google Routes Around Outages).
This is the difference between Google and so many other providers. Most troubleshooters will find the problem and restore service. Google looks to the root cause and how to prevent future outages. Personally, I went out for lunch when I couldn’t access Gmail, knowing that it would likely be restored by the time I got back. Indeed it was.
[...] Läs gärna också Data Center Knowledges artikel om gmails problem. [...]
Does “google.ca” use the same Request Routers, as it was also down with the same 502 error, that people were seeing for gmail.com.
[...] has acknowledged developing software with this goal in mind, and several recent Gmail service outages have reinforced the value of rapid load-shifting across data [...]
This is the difference between Google and so many other providers. Most troubleshooters will find the problem and restore service.