Load Balancer Misbehavior Cited in Google Outage
What happened during Monday’s Gmail outage? At the time, we observed that these widespread Google outages “usually involve software updates or networking issues. Or in some cases, a software update causing a networking issues.” According to an incident report, the cause was indeed a software update causing a networking issue, specifically in Google’s load balancers.
“Between 8:45 AM PT and 9:13 AM PT, a routine update to Google’s load balancing software was rolled out to production,” the respot says. “A bug in the software update caused it to incorrectly interpret a portion of Google data centers as being unavailable. The Google load balancers have a failsafe mechanism to prevent this type of failure from causing Googlewide service degradation, and they continued to route user traffic. As a result, most Google services, such as Google Search, Maps, and AdWords, were unaffected. However, some services, including Gmail, that require specific data center information to efficiently route users’ requests, experienced a partial outage.”
There was an interesting wrinkle to this outage that extended the impact more broadly. Wired Enterprise noted that the load balancer problems affected Google’s Sync web service, which allows Google users to share their Chrome browser settings across multiple devices. “It’s due to a backend service that sync servers depend on becoming overwhelmed, and sync servers responding to that by telling all clients to throttle all data types,” Google engineer Tim Steele said.
As a result, at the same time users were having trouble accessing Gmail, many Chrome users were experiencing mysterious browser crashes.
Google says it has fixed the load balancer bug and is changing the release process for software updates for load balancer software. Google’s incident report said it is “reviewing a multistep release process to push load balancer changes in one location before proceeding with a general rollout. The unique nature of load balancing systems makes this more difficult than with other software components.”