Is a network uptime dashboard useful if it goes down during an outage? That’s the question raised by Tuesday’s 40-minute outage at Salesforce.com, which left 900,000 subscribers without access to their applications. The downtime was attributed to a network device failing due to memory allocation errors.
“The failure caused it to stop passing data but did not properly trigger a graceful fail over to the redundant system as the memory allocation errors were present on the failover system as well,” Salesforce.com reported on its status dashboard. “This resulted in a full service failure for all instances. Salesforce.com had to initiate manual recovery steps to bring the service back up.”
The Register said the outage “exposed the dark side of cloud computing,” demonstrating the vulnerability of the cloud. Others took a more practical view of the issues raised by the downtime.
“This event clearly shows us why hosting your own public health dashboard is a problem,” writes Lenny Rachitsky at Transparent Uptime. “The dashboard was down along with the site itself.”
Rachitsky is not without an interest in the topic, as he’s a senior engineer at WebMetrics, which provides third-party performance monitoring. But his blog provides some useful analysis of web site performance, including a list of public health dashboards and status sites for SaaS providers.
Salesforce.com, for its part, says it will “continue to work with hardware vendors to fully detail the root cause and identify if further patching or fixes will be needed.”