Google Explains What Went Wrong to Cause PaaS Outage

Blames the incident on a software update for its traffic routers

Chris Burt

August 25, 2016

2 Min Read
Google data center in Council Bluffs, Iowa
An overhead view of the server infrastructure in Google’s data center in Council Bluffs, Iowa. (Photo: Connie Zhou for Google)Alphabet/Google

WHIR-logo.png

Brought to You by The WHIR

Google has released more details this week on what caused its Google App Engine outage earlier this month. The Aug. 11 outage affected 37 percent of applications hosted in its US-Central region, according to the incident report.

Google said that the outage lasted just under two hours in the afternoon (local time). Almost half of those affected, 18 percent of apps hosted in the region overall, experienced error rates between 10 and 50 percent, while 14 percent experienced error rates between one and 10 percent. Three percent had error rates higher than 50 percent.

Latency also increased for the apps impacted during the disruption. Other Google App Engine regions were not affected by the incident.

“We apologize for this incident,” Google said in the report. “We know that you choose to run your applications on Google App Engine to obtain flexible, reliable, high-performance service, and in this incident we have not delivered the level of reliability for which we strive.”

Google blamed the incident on a software update for its traffic routers, which triggered a rolling restart during standard periodic maintenance. The maintenance involved Google engineers shifting applications between data centers. The engineers “gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources.”

At this point, the reduced router capacity lead to rescheduled instance start-ups, slow start-ups and retried start requests, ultimately overloading even the extra system capacity. The company’s manual traffic redirection was still not enough to resolve the problem until a configuration error causing a traffic imbalance in the new data centers was identified and fixed.

Traffic routing capacity has been upgraded, and application rescheduling and system retry procedures will be changed to prevent a repeat of the incident.

A configuration error was also part of the cause of the brief Google Compute Engine outage in April.

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like