The Xen-based cloud reboot post mortems are up. Last week, Amazon Web Services and Rackspace both had to reboot parts of their clouds to fix a known security vulnerability affecting certain versions of XenServer, a popular open source hypervisor.
There were no reports of compromised data, although some reboots didn’t go as smoothly as others. The maintenance affected less than 10 percent of AWS’ EC2 fleet and nearly a quarter of Rackspace’s 200,000-plus customers.
The Xen project has a detailed security policy available here. It includes the protocols and processes for dealing with these kinds of issues.
It's important to note that these issues affect both open source and proprietary technology. This patch was not limited to AWS and Rackspace. They are just two big examples of cloud providers faced with a challenge that was quickly overcome.
The issue can finally be revealed without potential security repercussions. The vulnerability could have allowed those with malicious intent to read snippets of data belonging to others or to crash the host server through following a certain series of memory commands.
Rackspace worked with Xen partners following the security issue to develop a test patch and organize a reboot plan. The patch was ready the night of September 26. With the technical details scheduled to be publicly released today, the company has to work quickly.
“Whenever we at Rackspace become aware of a security vulnerability, whether in our systems or (as in this case) in third-party software, we face a balancing act,” wrote Rackspace CEO Taylor Rhodes. “We want to be as transparent as possible with you, our customers, so you can join us in taking actions to secure your data. But we don’t want to advertise the vulnerability before it’s fixed — lest we, in effect, ring a dinner bell for the world’s cyber criminals.”
“The zone-by-zone reboots were completed as planned and we worked very closely with our customers to ensure that the reboots went smoothly for them,” wrote AWS chef evangelist Jeff Barr.
AWS advised customers to re-examine infrastructure for possible ways to make it even more fault tolerant, including the use of Chaos Monkey, pioneered by Netflix to induce various kinds of failures in a controlled environment.