Urs Hölzle, Senior Vice President for Technical Infrastructure at Google, speaks during the Google I/O 2014 conference in San Francisco Stephen Lam/Getty Images
Urs Hölzle, Senior Vice President for Technical Infrastructure at Google, speaks during the Google I/O 2014 conference in San Francisco

How Google Avoids Cloud Downtime With VM Migration

Instead of shutting down clients’ VMs for maintenance, provider moves them to different hosts

Heartbleed, the security vulnerability that affected 17 percent of all web servers on the internet when it was disclosed last April, sent ripples of downtime across users’ infrastructure deployed with various public cloud providers as the providers rebooted cloud VMs to patch against the bug.

More widespread cloud reboots came in October of last year when major providers like Amazon Web Services, Rackspace, and IBM SoftLayer had to apply a patch to address a Xen hypervisor vulnerability. Another Xen update is driving cloud reboots this month.

Verizon took its entire cloud offline in January to apply an infrastructure upgrade. The company later said the upgrade included a change that would enable the company to uprage the infrastructure without taking it down in the future.

Google, however, has not had to bring customer VMs running in its Compute Engine cloud down since late 2013, when it introduced “transparent maintenance,” or a way to do live VM migration from one host to another to tinker with the infrastructure.

Miche Baker-Harvey, a tech lead for VM migration at Google, explained how Google does this in a blog post published earlier this week. Live VM migration helps Google address a multitude of issues, from regular server, network, or data center electrical infrastructure maintenance to security updates, system configuration changes, or host OS and BIOS updates.

Those were issues Google engineers expected to address with migration. Once the practice was implemented, however, they found that there were also other situations where live migration was useful. In one case, some servers had overheating batteries, affecting neighboring servers as well. Before bringing the offending server down to replace the battery, they moved VMs it was hosting to a different machine.

At a high level, the process they use is simple: copy as much state data as you can to the target VM while keeping the source VM running, and then move the remaining data to the target, causing a blackout so brief, it is completely unnoticeable to the customer. The move is registered in the customer’s log.

Here’s an infographic that explains the essentials, courtesy of Google:

Google VM migration

Google has done hundreds of thousands of VM migrations since transparent maintenance was rolled out. “Many VMs have been up since migration was introduced, and all of them have been migrated multiple times,” Baker-Harvey wrote.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish