-
Do Brief Outages Count? Google’s SLA Says No
December 4th, 2008 : Rich MillerIs an outage of less than 10 minutes still an outage? Not according to the Service Level Agreement (SLA) for Google Apps, which includes a 99.9 percent uptime guarantee. Pingdom read the fine print in the SLA and found the following definition:
“Downtime Period” means, for a domain, a period of ten consecutive minutes of Downtime. Intermittent Downtime for a period of less than ten minutes will not be counted towards any Downtime Periods.”
This loophole has been used to posit unlikely worst-case scenarios in which Google could have repeated short outages and still honor its SLA guarantees (which in turn has prompted discussion over at TechCrunch). The larger issue is whether Google is defining outages in a way that waters down the uptime guarantee, serving to provide additional protection to the provider rather than the customer.
How does your SLA define an outage?
Read More » -
Error Pages: Because Downtime Happens
December 3rd, 2008 : Rich MillerNobody likes downtime. But a growing number of web sites are offering custom pages for maintenance and/or 404 errors, hoping to defuse user frustration when sites or pages are unavailable. Pingdom has compiled a gallery of 24 Fun and Inspiring Web 2.0 Error Pages. Who has the most uninspired pages? Microsoft and Google, according to Pingdom.
Read More » -
-
Sears Web Site Hit By Black Friday Outage
November 28th, 2008 : Rich MillerThe huge crowds for Black Friday and Cyber Monday almost always result in outages for high-profile web sites. The early casualty this year appears to be Sears.com, which is experiencing performance problems Friday morning.
Read More » -
UPS Failure Triggered Friendster Outage
November 17th, 2008 : Rich MillerA “catastrophic” UPS failure caused a power outage Thursday at a Santa Clara data center operated by Quality Technology Services, triggering days of performance problems for the social network Friendster. Quality Tech said the outage occurred during planned maintenance when the facility was switched from utility power to backup diesel generators.
“Regrettably, the maintenance did not go as planned and we suffered a catastrophic UPS failure at 8:22 am Pacific Standard Time,” said Mark Waddington, President of Quality Technology Services, in an incident report for customers. “The UPS failed to stand in and smoothly transfer power from the utility to the temporary generators due to a voltage regulator problem with the temporary generators. The failure resulted in the triggering of the FM200 (fire suppression) system in the enclosed battery room and the subsequent EPO as part of our life safety system.”
FM200 is a popular fire suppression system that uses a chemical “clean agent” rather than water. The EPO (Emergency Power Off) button instantly cuts power in the data center when a situation presents a risk to worker safety or equipment.
The Santa Clara facility was back on generator power within two hours, but Friendster remained offline for more than 23 hours over three days. While it has been eclipsed in the U.S. by MySpace and Facebook, Friendster has seen strong growth in international markets (particularly the Philippines) and says it has 85 million users.
Read More » -
Site Outages for Friendster, Twitter
November 13th, 2008 : Rich MillerBoth Friendster and Twitter have experienced downtime this afternoon:
- Watch those DNS settings! Twitter said its brief downtime was due to a DNS configuration error that impacted the entire site. The outage for Twitter is notable because it hasn’t had any in a while and survived ElectionNight traffic with no major problems. “We’ve not had an outage of this length since mid-July and will be carefully reviewing what went wrong,” the Twitter team said on the site’s status page.
- The outage at Friendster appears to be more significant, and there are reports that it is related to power problems at its data center. The site remains offline, and a traceroute to Friendster ends at Quality Technology Services/Globix.
-
Early Struggles for FiveThirtyEight, 270toWin
November 4th, 2008 : Rich MillerAn update on Scaling for Election Night: We’re already seeing signs of significant performance problems at two popular sites tracking the Electoral College tally, FiveThirtyEight and 270toWin. Here’s a status report from FiveThirtyEight:
Apologies in advance — blogger.com looks likely to about to pull an epic fail tonight on our most important night. I’ve been clicking publish since before 6pm central on one post, and the rest of the internet is lightning speed. Just looking at results and fruitlessly clicking “publish.” We’re here and trying to publish; just can’t. They can’t handle the traffic. Sorry everybody.
Parts of the site are loading better than others, with blog posts seeming to fare better than maps. UPDATE: FiveThirtyEight appears to be loading faster now (8:35 pm Eastern). Perhaps someone at Google has noticed and intervened.
Over at 270toWin, site operators are also reporting difficulties handling traffic loads:
The site may be slow or unreachable this evening due to extreme traffic levels. At 6PM, the ‘2008 Swing States’ view on the home page interactive map was set to all swing and we will attempt to update through the evening. … Other features on the site will not be updated and some have been disabled.
FiveThirtyEight is hosted on Google’s Blogger platform, while 270toWin hosts with 1&1 Internet.
Here’s the good news: Twitter appears to be doing fine so far.
Read More » -
Outage for Wordpress.com Blog Platform
October 27th, 2008 : Rich MillerWordpress.com, which hosts more than 4.5 million blogs, experienced performance problems this morning during a denial of service attack. Many site owners said their blogs were completely unreachable, while others reported that elements of their site were not functioning properly.
“At approximately 9:40AM EST this morning, we suffered a Distributed Denial of Service attack which caused some blogs to become unavailable for a short period of time,” a rep for Automattic wrote on the Wordpress.com customer forums. “About 20 minutes after the attack started, the sources and target were identified and legitimate traffic was routed to our other 3 data centers, which were unaffected by the attack.”
Read More »
-
Ping.fm Back Online After Domain Snafu
October 22nd, 2008 : Rich MillerWant another good reason to stick with old-school .com and .net domain names instead of one of those quirky new top-level domains? Apply data center engineering logic: what happens with these newer domain extensions when something goes badly awry? It looks like Ping.fm just learned this the hard way, as a problem managing its domain name knocked the service offline for more than a day.
Ping.fm, which provides social media updating for multiple services (i.e. post simultaneously to Twitter, FriendFeed, Facebook) realized at about 8 a.m. Monday that its domain was displaying a GoDaddy parking ad page. The Ping.fm team first blamed GoDaddy (”they seemingly ’screwed us’”) as it used its Twitter feed to combat rumors that the service had closed. By Monday night GoDaddy was clearly trying to get the domain back online. “Actually, the office of the President of GoDaddy called me personally ensuring that they are working on this with a high priority,” wrote Ping.fm CEO Sean McCullough. So how does the domain remain offline once Bob Parsons gets involved?
Read More »

