Outages at Paypal, Mozilla, DreamHost, MT
There were a number of outages this weekend at prominent sites. Most were brief, and the longest of them actually contained good news for customers.
- Paypal, one of the world’s largest payment processing sites, had an outage Friday night. “The PayPal website was partially unavailable for approximately an hour starting at 10:30 PM.,” the company said in an advisory. “During this time PayPal users may have received Page Cannot be Displayed errors and may not have been able to complete PayPal payments.” An hour doesn’t seem like very long, but PayPal processed $14.04 billion in payments in 4Q07 – a rate of about $6.3 million per hour.
- The Mozilla Corporation had about 90 minutes of downtime on March 18. “From initial investigation, it appears that one of the switches in a blade server chassis had a software issue, causing a network-wide broadcast storm,” Mozilla’s Justin Fitzhugh writes in an outage report. “Overall effect was that the switching fabric for our San Jose datacenter was unusable.” Mozilla is also looking for more mirrors.
- It was bad news, good news for DreamHost customers. The bad news: Many DreamHost services and sites were offline for 12 hours Friday and Saturday as the company migrated a major server cluster between data centers. The good news? The move went smoothly and the sites and services were back online as scheduled. We’ve often noted problems at DreamHost, and wanted to give equal time to them getting this one right. Not all data center migrations go smoothly, as customers of ValueWeb and Alabanza could tell you.
- Media Temple had some unexpected downtime Sunday afternoon for one of its Grid-Service clusters, which followed a brief outage Saturday.
The blade-switch error is not uncommon. One of our best engineers was at a customer site where the customer was complaining that their network was performing slowly. A detailed look showed that one of the Blade Switches (vendor to remain nameless, but not from us) had decided it wanted to be the root bridge for the entire L2 domain.
A few ways to solve this-
1) hard code the priority
2) set ‘Root Guard’ on in the aggregation switches
3) design Spanning-Tree out of the network as a topology bound protocol (getting harder because of server virtualization)
Not an uncommon event though, so worth checking into in many infrastructures where blade switches may be rather ‘haphazardly’ deployed.
Mike SullivanPosted March 24th, 2008
Netflix.com has been down for at least a few hours on Monday morning. First the site sent back “Service Unavailable” errors, and now it is resetting connections without loading any pages.