Microsoft’s Windows Azure cloud service has been hit with a series of performance problems today, leaving customers unable to manage their applications for about 8 hours and knocking Azure-based services offline for some North American users.
Microsoft said the Azure service management problems were caused by a “a cert issue triggered on 2/29/2012″ – presumably a date-related glitch with a security certificate triggered by the onset of the Feb. 29th “Leap Day” which occurs once every four years. UPDATE: Microsoft has now confirmed this. “While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year,” Microsoft’s Bill Laing writes on the Windows Azure blog.
The Azure team deployed a software update to fix the problem, which was rolled out gradually. Microsoft said management functions were “restored for the majority of customers” by 1:30 pm GMT (8:30 am Eastern).
The Windows Azure Compute service began experiencing problems early this morning, several hours after the service management issues were seen.
“Incoming traffic may not go through for a subset of hosted services,” Microsoft said. “Deployed applications will continue to run … We are executing restoration steps to mitigate the issue.” Microsoft apologized for the inconvenience to users.
The outage is the latest in a series of cloud outages that are shaping how users approach resiliency of their cloud applications. In April 2011 Amazon Web Services experienced an extended outage that caused downtime or performance problems for many social media services that rely on the company’s cloud computing services. In August the European cloud operations of both Microsoft and Amazon were knocked offline by a power outage in Dublin.
Perhaps the biggest impact of the outage has been seen in how existing users approach cloud architectures, according to Fellows. “End users now want to mandate that they have multi-cloud strategies,” said William Fellows, co-founder and Principal Analyst at The 451 Group, in a panel last fall discussing the outages.