Windows Azure Cloud Hit By Downtime
Microsoft’s Windows Azure cloud service has been hit with a series of performance problems today, leaving customers unable to manage their applications for about 8 hours and knocking Azure-based services offline for some North American users.
Microsoft said the Azure service management problems were caused by a “a cert issue triggered on 2/29/2012″ – presumably a date-related glitch with a security certificate triggered by the onset of the Feb. 29th “Leap Day” which occurs once every four years. UPDATE: Microsoft has now confirmed this. “While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year,” Microsoft’s Bill Laing writes on the Windows Azure blog.
The Azure team deployed a software update to fix the problem, which was rolled out gradually. Microsoft said management functions were “restored for the majority of customers” by 1:30 pm GMT (8:30 am Eastern).
The Windows Azure Compute service began experiencing problems early this morning, several hours after the service management issues were seen.
“Incoming traffic may not go through for a subset of hosted services,” Microsoft said. “Deployed applications will continue to run … We are executing restoration steps to mitigate the issue.” Microsoft apologized for the inconvenience to users.
The outage is the latest in a series of cloud outages that are shaping how users approach resiliency of their cloud applications. In April 2011 Amazon Web Services experienced an extended outage that caused downtime or performance problems for many social media services that rely on the company’s cloud computing services. In August the European cloud operations of both Microsoft and Amazon were knocked offline by a power outage in Dublin.
Perhaps the biggest impact of the outage has been seen in how existing users approach cloud architectures, according to Fellows. “End users now want to mandate that they have multi-cloud strategies,” said William Fellows, co-founder and Principal Analyst at The 451 Group, in a panel last fall discussing the outages.
[...] looks like the problem might be due to a leap year issue, according to an article on Data Center Knowledge, Microsoft said the Azure service management problems were caused by a “a cert issue triggered on [...]
[...] that post: “Microsoft said the Azure service management problems were caused by a ‘a cert issue triggered on 2/29/2012′ – presumably a date-related glitch with a securi…‘ which occurs once every four [...]
[...] noted by Data Center Knowledge, it’s the latest in a series of cloud computing outages, with the Amazon Web Services [...]
As with most public cloud systems, Azure also experienced its day in the dark. Having spent the last 3+ years in this ecosystem, I can say that most of Azure users are in the long tail. Startups ,upstart efforts from large enterprises etc and this outage went under the radar for the most part due to the long tail nature of the user base.
As with most apps deploying to public clouds, a good monitoring and auto-healing system is critical to ensure timely notification and recovery.
We have outlined that in our blog here http://www.opstera.com/blog/