Google Explains What Went Wrong to Cause PaaS Outage

Blames the incident on a software update for its traffic routers

August 25, 2016

2 Min Read

Google data center in Council Bluffs, Iowa

An overhead view of the server infrastructure in Google’s data center in Council Bluffs, Iowa. (Photo: Connie Zhou for Google)Alphabet/Google

Brought to You by The WHIR

Google has released more details this week on what caused its Google App Engine outage earlier this month. The Aug. 11 outage affected 37 percent of applications hosted in its US-Central region, according to the incident report.

Google said that the outage lasted just under two hours in the afternoon (local time). Almost half of those affected, 18 percent of apps hosted in the region overall, experienced error rates between 10 and 50 percent, while 14 percent experienced error rates between one and 10 percent. Three percent had error rates higher than 50 percent.

Latency also increased for the apps impacted during the disruption. Other Google App Engine regions were not affected by the incident.

“We apologize for this incident,” Google said in the report. “We know that you choose to run your applications on Google App Engine to obtain flexible, reliable, high-performance service, and in this incident we have not delivered the level of reliability for which we strive.”

Google blamed the incident on a software update for its traffic routers, which triggered a rolling restart during standard periodic maintenance. The maintenance involved Google engineers shifting applications between data centers. The engineers “gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources.”

At this point, the reduced router capacity lead to rescheduled instance start-ups, slow start-ups and retried start requests, ultimately overloading even the extra system capacity. The company’s manual traffic redirection was still not enough to resolve the problem until a configuration error causing a traffic imbalance in the new data centers was identified and fixed.

Traffic routing capacity has been upgraded, and application rescheduling and system retry procedures will be changed to prevent a repeat of the incident.

A configuration error was also part of the cause of the brief Google Compute Engine outage in April.

About the Author(s)

Chris Burt

See more from Chris Burt

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

Google Explains What Went Wrong to Cause PaaS Outage

About the Author(s)

Editor's Choice

Industry Voices

Featured How Tos

Related Topics

Recent in Infrastructure

Related Topics

Recent in Build & Design

Related Topics

Recent in Ops & Mgmt

Related Topics

Recent in Business

Related Topics

Recent in Security

Related Topics

Recent in Next-Gen

Related Topics

Recent in Sustainability

Related Topics

<span class="ArticleBase-LargeTitle">Google Explains What Went Wrong to Cause PaaS Outage</span>Google Explains What Went Wrong to Cause PaaS Outage

About the Author(s)

Editor's Choice

Industry Voices

Featured How Tos

Google Explains What Went Wrong to Cause PaaS Outage