Microsoft issued an apology regarding the service outage that left many without email access for most of last Tuesday.
The problem was with Lync Online and Exchange Online services, with the brunt of the outage coming on Tuesday. Rajesh Jha, corporate vice president, Office 365 Engineering, apologized on behalf of the Office 365 team and detailed the two issues that led to the outage. A post incident report will also be issued for further analysis of what happened, how they responded, and what they will do to prevent similar issues in the future.
It’s a good step towards restoring faith. While things can be frantic during an outage, a service provider needs to actively inform customers while it happens. Adding to the problem was that the Service Health Dashboard also experienced problems which meant not all impacted customers were notified in a timely way. Jha said that the problem with SHD has been addressed.
To prevent such problems, service providers often host status dashboards on infrastructure that's separate from their main services (for Salesforce.com, for example, it was a lesson learned early on).
Lync Online (instant messaging and voice) saw brief loss of client connectivity in North American data centers due to external network failures. Connectivity was restored, but the ensuing traffic spike caused several network elements to get overloaded.
The Exchange Online issue was triggered by an intermittent failure in a directory role that caused a directory partition to stop responding to authentication requests. This caused “a small set” of customers to lose email access. The issue led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw, and this is when the outage went wide.
Microsoft fought it by partitioning the mail delivery system away from the failed directory partition, then attacked the root cause. Jha says the team is working on further layers of hardening for this pattern.
"While we have fixed the root causes of the issues, we will learn from this experience and continue improving our proactive monitoring, prevention, recovery and defense in depth systems,” wrote Jha. “I appreciate the trust you have placed in our service. My team and I are committed to continuously earning and maintaining your trust every day. Once again, I apologize for the recent service issues.”
Read Jha's post in full here.