• When The Power Goes Out at Google

    March 8th, 2010 : Rich Miller

    What happens when the power goes out at a Google data center? We found out on Feb. 24, when a power outage at a Google facility caused more than two hours of downtime for Google App Engine, the company’s cloud computing platform for developers. Last week the company released a detailed incident report on the outage, which underscored the critical importance of good documentation, even in huge data center networks with failover capacity.

    Most of Google’s recent high-profile outages have been caused by routing or network capacity problems, including outages in May and September of last year (see How Google Routes Around Outages for more). But not so with the Feb. 24 event.

    “The underlying cause of the outage was a power failure in our primary datacenter,” Google reported. “While the Google App Engine infrastructure is designed to quickly recover from these sort of failures, this type of rare problem, combined with internal procedural issues extended the time required to restore the service.”

    Power Down for 30 Minutes
    Data center power outages typically fall into two categories: those in which the entire data center loses power for an extended period, and those in which power is restored relatively quickly but hardware within the data center has trouble restarting properly. The Google App Engine downtime appears to fall into the latter category. Power to the primary data center was restored within a half hour, but a key group of servers failed to restart properly. The somewhat unusual pattern of the recovery presented the first challenge.

    Read More »
  • Wordpress.com Blogs Back After Outage

    February 18th, 2010 : Rich Miller

    The Wordpress.com blog hosting service has been offline this afternoon, causing downtime for more than 10 million blogs. “We’re investigating the source & most expedient fix,” Wordpress founder Matt Mullenweg posted on the service’s Twitter account. “We hope to have everyone’s blogs back & running as soon as possible.”

    UPDATE: As of 6:20 p.m. Eastern time, prominent sites hosted on Wordpress.com were beginning to come back online after about two hours of downtime.”We are back running at full capacity now,” Mullenweg reported at 6:30 pm. “Closely monitoring services for any aftershocks.

    Read More »
  • Transfer Switch Cited in NaviSite Outage

    January 28th, 2010 : Rich Miller

    NaviSite says it will overhaul the surge suppression system at its San Jose, Calif. data center in the wake of last week’s power outage at the facility. In an incident report to customers, NaviSite (NAVI) said the facility’s surge suppression system didn’t adequately protect relay fuses within an automated transfer switch (ATS) from a power surge at the onset of a utility outage on Jan. 19.

    The damage to the relay fuses left the ATS unable to start the facility’s diesel backup generators, as they normally would in a utility outage. With the generators offline, the data center switched over to battery power from the uninterruptible power supply (UPS).

    “During this time, the UPS batteries were drained, and once the batteries were drained, power was lost to the data center floor at approximately 4:56 am PST,” NaviSite reported. “Power was restored to the data center by manually starting the generators and transferring load to the generators. Power was restored to the data center at 5:35 am PST.”

    Read More »
  • U. of Penn Data Center Overheats

    January 20th, 2010 : Rich Miller

    As data centers get warmer, the environment gets less forgiving. That lesson was learned the hard way at the University of Pennsylvania, which on Tuesday had to shut off the IT systems supporting the school’s financial, research and student services.

    The University Data Center experienced “an excessive heat condition” on Tuesday afternoon, according to the Daily Pennsylvanian. The incident occurred when one of the two glycol pumps supporting the Data Center was accidentally switched from automatic to manual during an equipment replacement, resulted in overheating.

    Read More »
  • Twitter Down, Overwhelmed by Whales

    January 20th, 2010 : Rich Miller

    Twitter was offline this morning, experiencing its longest sustained downtime since an Aug. 8 outage from a denial of service attack. Reliability has been on ongoing project for Twitter as the service has scaled up to handle growing traffic. This popular microblogging service has been offline for about an hour this morning, according to the Pingdom uptime monitoring service. UPDATE: Looks like the Twitter.com site is available again as of about 7:55 a.m. Eastern time.

    “We are experiencing an outage due to an extremely high number of whales,” reports the Twitter status page. “A sudden failure coupled with problems in switching to a backup system produced a high number of errors for around 90 minutes. This made the site largely inaccessible. No data was lost or compromised during this outage.”

    The “whales” comment refers to the “Fail Whale” – the downtime mascot that appears whenever Twitter is unavailable. The appearance of the Fail Whale indicates a server error known as a 503, which then triggers a “Whale Watcher” script that prompts a review of the last 100,000 lines of server logs to sort out what has happened.

    When at all possible, Twitter tries to adapt by slowing the site performance as an alternative to a 503. In some cases, this means disabling features like custom searches. In recent weeks Twitter.com users have periodically encountered messages that the service was over capacity, but the condition was usually temporary. At times of heavy load for more on how Twitter manages its capacity challenges, see Using Metrics to Vanquish the Fail Whale.

    Twitter’s last major downtime event was a 3 hour, 40 minute outage on August 6, when Twitter was among the social networking sites targeted by an electronic attack, which prompted the service to beef up its network defenses.

    While some Twitter-watchers continue to debate whether its growth is continuing, co-founder Ev Williams posted Jan. 12 that “across all metrics that matter, yesterday was Twitter’s highest-usage day ever. (And today will be bigger.)”

    Read More »
  • Storms KO NaviSite San Jose Data Center

    January 19th, 2010 : Rich Miller

    A NaviSite data center in Silicon Valley was without power for an hour this morning after severe storms knocked out the facility’s utility power from PG&E. NaviSite’s San Jose data center lost utility power from PG&E at 4:45 a.m. Pacific time, and backup power systems failed to operate as designed.   

    “Generator power has been restored to the data center in San Jose, but the site was without power for approximately 45 to 60 minutes,” NaviSite reported on the company blog. “The data center has been and continues to run on generator power.  We are still waiting for street power to become available, but will not switch back over until we have an understanding of what caused the original issue.”

    Read More »
  • Power Problems at Rackspace London Facility

    January 18th, 2010 : Rich Miller

    A UPS failure caused a power outage today at a Rackspace data center in London, leaving several hundred servers offline for hours as technicians needed to help restarting equipment. The incident occurred at the LON03 data center, one of several facilities in the company’s growing London operation.

    The power interruption started at 9:19 a.m. local time when a module failed on an uninterruptible power supply (UPS) and the unit failed to transfer the load properly, Rackspace said in its status update. Power was restored for most customers by 11:30 a.m., but a subset of servers failed to restart properly.

    Read More »
  • Performance Problems for Rackspace Cloud

    January 14th, 2010 : Rich Miller

    Rackspace reports that its cloud computing service is “degraded,” with many customers reporting their sites are unreachable. The company attributed the problem to an unusual load spike in the storage system supporting its cloud platform. The outage came several hours after the Rackspace Cloud disabled CRON, a command commonly used to automate tasks on Unix and Linux systems. By early evening, the company said performance had improved.

    “Starting yesterday we began experiencing very high loads on our storage devices for cluster WC1 in DFW,” Rackspace said on its status page. ”In order to reduce load we have shut down processes like CRON to ensure core site content continue to load.  While load spikes are common in our cloud infrastructure, we have not been able to fully identify the root cause of these unusual issues.

    Read More »
  • Salesforce.com Hit by One Hour Outage

    January 4th, 2010 : Rich Miller

    Enterprise cloud computing provider Salesforce.com says it has resolved an outage that knocked its services offline for about an hour and 15 minutes this afternoon. Salesforce.com has nearly 68,000 customers using its online applications, including Dell, Dow Jones Newswires and SunTrust Banks. The company says the incident “affected all instances.”

    “The Salesforce.com Technology Team has resolved the service disruption issues on all instances from 12:10PM PST to 1:25PM PST,” the company reported on its status dashboard. “All services are restored at this time. We are performing a review of the incident and will take any corrective action needed. We apologize for any inconvenience this may have caused you and appreciate your patience.”

    UPDATE: Users on Twitter report continuing problems trying to log onto their apps. Saleforce.com reported a second, briefer outage affecting its NA2 instance. “The problem began at 1:49PM PST and was resolved as of 1:56 PM PST,” the company said.

    Read More »

ARCHIVED ARTICLES

All Content on Data Center Knowledge
© 2009 Miller Webworks LLC
All Rights Reserved