-
UPS Failure Triggered Friendster Outage
November 17th, 2008 : Rich MillerA “catastrophic” UPS failure caused a power outage Thursday at a Santa Clara data center operated by Quality Technology Services, triggering days of performance problems for the social network Friendster. Quality Tech said the outage occurred during planned maintenance when the facility was switched from utility power to backup diesel generators.
“Regrettably, the maintenance did not go as planned and we suffered a catastrophic UPS failure at 8:22 am Pacific Standard Time,” said Mark Waddington, President of Quality Technology Services, in an incident report for customers. “The UPS failed to stand in and smoothly transfer power from the utility to the temporary generators due to a voltage regulator problem with the temporary generators. The failure resulted in the triggering of the FM200 (fire suppression) system in the enclosed battery room and the subsequent EPO as part of our life safety system.”
FM200 is a popular fire suppression system that uses a chemical “clean agent” rather than water. The EPO (Emergency Power Off) button instantly cuts power in the data center when a situation presents a risk to worker safety or equipment.
The Santa Clara facility was back on generator power within two hours, but Friendster remained offline for more than 23 hours over three days. While it has been eclipsed in the U.S. by MySpace and Facebook, Friendster has seen strong growth in international markets (particularly the Philippines) and says it has 85 million users.
Read More » -
Site Outages for Friendster, Twitter
November 13th, 2008 : Rich MillerBoth Friendster and Twitter have experienced downtime this afternoon:
- Watch those DNS settings! Twitter said its brief downtime was due to a DNS configuration error that impacted the entire site. The outage for Twitter is notable because it hasn’t had any in a while and survived ElectionNight traffic with no major problems. “We’ve not had an outage of this length since mid-July and will be carefully reviewing what went wrong,” the Twitter team said on the site’s status page.
- The outage at Friendster appears to be more significant, and there are reports that it is related to power problems at its data center. The site remains offline, and a traceroute to Friendster ends at Quality Technology Services/Globix.
-
-
Early Struggles for FiveThirtyEight, 270toWin
November 4th, 2008 : Rich MillerAn update on Scaling for Election Night: We’re already seeing signs of significant performance problems at two popular sites tracking the Electoral College tally, FiveThirtyEight and 270toWin. Here’s a status report from FiveThirtyEight:
Apologies in advance — blogger.com looks likely to about to pull an epic fail tonight on our most important night. I’ve been clicking publish since before 6pm central on one post, and the rest of the internet is lightning speed. Just looking at results and fruitlessly clicking “publish.” We’re here and trying to publish; just can’t. They can’t handle the traffic. Sorry everybody.
Parts of the site are loading better than others, with blog posts seeming to fare better than maps. UPDATE: FiveThirtyEight appears to be loading faster now (8:35 pm Eastern). Perhaps someone at Google has noticed and intervened.
Over at 270toWin, site operators are also reporting difficulties handling traffic loads:
The site may be slow or unreachable this evening due to extreme traffic levels. At 6PM, the ‘2008 Swing States’ view on the home page interactive map was set to all swing and we will attempt to update through the evening. … Other features on the site will not be updated and some have been disabled.
FiveThirtyEight is hosted on Google’s Blogger platform, while 270toWin hosts with 1&1 Internet.
Here’s the good news: Twitter appears to be doing fine so far.
Read More » -
Outage for Wordpress.com Blog Platform
October 27th, 2008 : Rich MillerWordpress.com, which hosts more than 4.5 million blogs, experienced performance problems this morning during a denial of service attack. Many site owners said their blogs were completely unreachable, while others reported that elements of their site were not functioning properly.
“At approximately 9:40AM EST this morning, we suffered a Distributed Denial of Service attack which caused some blogs to become unavailable for a short period of time,” a rep for Automattic wrote on the Wordpress.com customer forums. “About 20 minutes after the attack started, the sources and target were identified and legitimate traffic was routed to our other 3 data centers, which were unaffected by the attack.”
Read More »
-
Ping.fm Back Online After Domain Snafu
October 22nd, 2008 : Rich MillerWant another good reason to stick with old-school .com and .net domain names instead of one of those quirky new top-level domains? Apply data center engineering logic: what happens with these newer domain extensions when something goes badly awry? It looks like Ping.fm just learned this the hard way, as a problem managing its domain name knocked the service offline for more than a day.
Ping.fm, which provides social media updating for multiple services (i.e. post simultaneously to Twitter, FriendFeed, Facebook) realized at about 8 a.m. Monday that its domain was displaying a GoDaddy parking ad page. The Ping.fm team first blamed GoDaddy (”they seemingly ’screwed us’”) as it used its Twitter feed to combat rumors that the service had closed. By Monday night GoDaddy was clearly trying to get the domain back online. “Actually, the office of the President of GoDaddy called me personally ensuring that they are working on this with a high priority,” wrote Ping.fm CEO Sean McCullough. So how does the domain remain offline once Bob Parsons gets involved?
Read More » -
‘Router Failure’ at TBS Cited in ALCS Outage
October 18th, 2008 : Rich MillerA power outage at TBS network’s broadcast facility in Atlanta blacked out TV coverage of almost the entire first inning of Game 6 of the American League Championship Series (ALCS) tonight. When the broadcast resumed, the Rays led 1-0 on a solo home run by outfielder B.J. Upton against Red Sox starter Josh Beckett.
“Two circuit breakers in our Atlanta transmission operations tripped causing the master router and its backup - which are necessary to transmit any incoming feed outbound - to shut down,” TBS said in a statement. “This impacted our live feed from being distributed to any of the other networks in the Turner portfolio and caused the delay in our coverage. Both our primary and backup routers were impacted by this problem. We apologize to baseball fans for this mishap that caused a delay in our coverage.”
Read More » -
Nature vs. The Power Grid
October 17th, 2008 : Rich MillerWhen it comes to data center connectvity outages, backhoes are usually Public Enemy Number One. What about power outages? These often have natural causes, as documented by Pingdom in a post that examines Mother Nature’s assault on electricity and the Internet. A review of power outages thus far in 2008 reveals that the grid has been brought low by rats, squirrels, birds, snakes, raccoons and at least one opossum. Storms, floods and earthquakes also figure prominently.
Read More » -
Lengthy Outage for Some Gmail Users
October 17th, 2008 : Rich MillerSome users of Google’s Gmail were unable to use the popular webmail service Wednesday and yesterday, with the downtime reaching 24 hours for some frustrated Google Apps customers, ComputerWorld reports.
“We’re aware of a problem with Gmail access affecting a small number of users,” a Google admin said in the support forum for Google Apps. “We expect to resolve the problem on October 16th at 6:00pm Pacific Time, although some users may regain access sooner.” But it wasn’t until about 7 am today (Friday) that Google confirmed that the issue had been resolved. No details were offered about the nature of the outage, which produced 502 server errors when users tried to login, or why it took so long to address.
Read More » -
Downtime on the Rise at LinkedIn?
October 10th, 2008 : Rich MillerLinkedIn, the social network for professionals, has been hit with a spate of outages, including more than an hour of downtime last night. Pingdom has a report on the site’s recent issues, and wonders whether LinkedIn is experiencing scaling challenges, noting its one-year growth from 14 million users to 25 million.
Periodic outages are nothing new for LinkedIn, which placed in the middle of the pack in Pingdom’s ranking of social network downtime earlier this year, with 4 hours of downtime in the first two months of 2008. But in the last five weeks LinkedIn has been offline for more than 9 hours, including a 5 hour outage on Sept. 6. In August LinkedIn announced that it had expanded its infrastructure in a Chicago data center operated by Equinix (EQIX).
I’m a casual LinkedIn user (here’s my profile). But it’s clear that many folks in our industry use LinkedIn, and the site features a growing number of discussion groups for data center professionals, including the new Data Center Pulse group founded by Dean Nelson from Sun and Mark Thiele from VMware.
Read More » -
TypePad Tweets Its Downtime
October 3rd, 2008 : Rich MillerTwitter is becoming an important communications tool for hosting companies experiencing outages, especially as it’s become more stable. An example: Last night Six Apart used Twitter to update users on an outage for its TypePad blog hosting service. “A bad power supply in one of our core routers has some TypePad blogs not displaying,” the company reported last night. “Updates shortly and of course on status.typepad.com.” Pingdom reports that the outage lasted for about an hour.
An hour may not seem like much. But news spreads quickly for companies that host a lot of blogs, and Twitter is increasingly where outage reports are turning up first. Sometimes these tweets are from frustrated users, and sometimes from a company complaining about its data center provider.
Six Apart is just one of many companies in the hosting and data center sector who are monitoring Twitter and using it to communicate with users during outages (see Ogilvy for more about Six Apart’s approach to “Twustomer Service”). Are you using Twitter to track what people are saying about your company, or to provide updates during outages? Leave a comment and tell us about it.
Read More »
