Router Ripples Cited in GMail Outage

Today's Gmail outage resulted from underestimating the load that routine maintenance on "a small fraction" of Gmail servers would place on the routers supporting the application, Google said.

Rich Miller

September 2, 2009

2 Min Read

Google has published an update on this afternoon's Gmail downtime. "Today's outage was a Big Deal, and we're treating it as such," writes Ben Treynor, Google's VP of Engineering and Site Reliability Czar. "We're currently compiling a list of things we intend to fix or improve as a result of the investigation."

The problem? Treynor says Google underestimated the load that routine maintenance on "a small fraction" of Gmail servers would place on the routers supporting the application. "At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!'," Treynor wrote. "This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded."

The overloaded routers meant that users couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. But IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.

Google fixed the problem by bringing additional routers online, and says it has since increased Gmail's router capacity "well beyond peak demand to provide headroom." But more refinements are in the works to avert a repeat.

"We have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load)," Treynor reported.

Gmail also had an extended outage in February when a code change triggered “cascading problems” that overloaded a data center in Europe. For more background on that outage, aliong with a broader overview of Google’s incident management, see our interview with Google senior VP of operations Urs Hoelzle (How Google Routes Around Outages).

About the Author

Rich Miller

See more from Rich Miller

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

Router Ripples Cited in GMail Outage

About the Author

Editor's Choice

Industry Voices

Featured Technical Explainers

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

<span class="ArticleBase-LargeTitle">Router Ripples Cited in GMail Outage</span>Router Ripples Cited in GMail Outage

About the Author

Editor's Choice

Industry Voices

Featured Technical Explainers

Router Ripples Cited in GMail Outage