• Transfer Switch Cited in NaviSite Outage

    January 28th, 2010 : Rich Miller

    NaviSite says it will overhaul the surge suppression system at its San Jose, Calif. data center in the wake of last week’s power outage at the facility. In an incident report to customers, NaviSite (NAVI) said the facility’s surge suppression system didn’t adequately protect relay fuses within an automated transfer switch (ATS) from a power surge at the onset of a utility outage on Jan. 19.

    The damage to the relay fuses left the ATS unable to start the facility’s diesel backup generators, as they normally would in a utility outage. With the generators offline, the data center switched over to battery power from the uninterruptible power supply (UPS).

    “During this time, the UPS batteries were drained, and once the batteries were drained, power was lost to the data center floor at approximately 4:56 am PST,” NaviSite reported. “Power was restored to the data center by manually starting the generators and transferring load to the generators. Power was restored to the data center at 5:35 am PST.”

    Read More »
  • U. of Penn Data Center Overheats

    January 20th, 2010 : Rich Miller

    As data centers get warmer, the environment gets less forgiving. That lesson was learned the hard way at the University of Pennsylvania, which on Tuesday had to shut off the IT systems supporting the school’s financial, research and student services.

    The University Data Center experienced “an excessive heat condition” on Tuesday afternoon, according to the Daily Pennsylvanian. The incident occurred when one of the two glycol pumps supporting the Data Center was accidentally switched from automatic to manual during an equipment replacement, resulted in overheating.

    Read More »
  • Twitter Down, Overwhelmed by Whales

    January 20th, 2010 : Rich Miller

    Twitter was offline this morning, experiencing its longest sustained downtime since an Aug. 8 outage from a denial of service attack. Reliability has been on ongoing project for Twitter as the service has scaled up to handle growing traffic. This popular microblogging service has been offline for about an hour this morning, according to the Pingdom uptime monitoring service. UPDATE: Looks like the Twitter.com site is available again as of about 7:55 a.m. Eastern time.

    “We are experiencing an outage due to an extremely high number of whales,” reports the Twitter status page. “A sudden failure coupled with problems in switching to a backup system produced a high number of errors for around 90 minutes. This made the site largely inaccessible. No data was lost or compromised during this outage.”

    The “whales” comment refers to the “Fail Whale” – the downtime mascot that appears whenever Twitter is unavailable. The appearance of the Fail Whale indicates a server error known as a 503, which then triggers a “Whale Watcher” script that prompts a review of the last 100,000 lines of server logs to sort out what has happened.

    When at all possible, Twitter tries to adapt by slowing the site performance as an alternative to a 503. In some cases, this means disabling features like custom searches. In recent weeks Twitter.com users have periodically encountered messages that the service was over capacity, but the condition was usually temporary. At times of heavy load for more on how Twitter manages its capacity challenges, see Using Metrics to Vanquish the Fail Whale.

    Twitter’s last major downtime event was a 3 hour, 40 minute outage on August 6, when Twitter was among the social networking sites targeted by an electronic attack, which prompted the service to beef up its network defenses.

    While some Twitter-watchers continue to debate whether its growth is continuing, co-founder Ev Williams posted Jan. 12 that “across all metrics that matter, yesterday was Twitter’s highest-usage day ever. (And today will be bigger.)”

    Read More »
  • Storms KO NaviSite San Jose Data Center

    January 19th, 2010 : Rich Miller

    A NaviSite data center in Silicon Valley was without power for an hour this morning after severe storms knocked out the facility’s utility power from PG&E. NaviSite’s San Jose data center lost utility power from PG&E at 4:45 a.m. Pacific time, and backup power systems failed to operate as designed.   

    “Generator power has been restored to the data center in San Jose, but the site was without power for approximately 45 to 60 minutes,” NaviSite reported on the company blog. “The data center has been and continues to run on generator power.  We are still waiting for street power to become available, but will not switch back over until we have an understanding of what caused the original issue.”

    Read More »
  • Power Problems at Rackspace London Facility

    January 18th, 2010 : Rich Miller

    A UPS failure caused a power outage today at a Rackspace data center in London, leaving several hundred servers offline for hours as technicians needed to help restarting equipment. The incident occurred at the LON03 data center, one of several facilities in the company’s growing London operation.

    The power interruption started at 9:19 a.m. local time when a module failed on an uninterruptible power supply (UPS) and the unit failed to transfer the load properly, Rackspace said in its status update. Power was restored for most customers by 11:30 a.m., but a subset of servers failed to restart properly.

    Read More »
  • Performance Problems for Rackspace Cloud

    January 14th, 2010 : Rich Miller

    Rackspace reports that its cloud computing service is “degraded,” with many customers reporting their sites are unreachable. The company attributed the problem to an unusual load spike in the storage system supporting its cloud platform. The outage came several hours after the Rackspace Cloud disabled CRON, a command commonly used to automate tasks on Unix and Linux systems. By early evening, the company said performance had improved.

    “Starting yesterday we began experiencing very high loads on our storage devices for cluster WC1 in DFW,” Rackspace said on its status page. ”In order to reduce load we have shut down processes like CRON to ensure core site content continue to load.  While load spikes are common in our cloud infrastructure, we have not been able to fully identify the root cause of these unusual issues.

    Read More »
  • Salesforce.com Hit by One Hour Outage

    January 4th, 2010 : Rich Miller

    Enterprise cloud computing provider Salesforce.com says it has resolved an outage that knocked its services offline for about an hour and 15 minutes this afternoon. Salesforce.com has nearly 68,000 customers using its online applications, including Dell, Dow Jones Newswires and SunTrust Banks. The company says the incident “affected all instances.”

    “The Salesforce.com Technology Team has resolved the service disruption issues on all instances from 12:10PM PST to 1:25PM PST,” the company reported on its status dashboard. “All services are restored at this time. We are performing a review of the incident and will take any corrective action needed. We apologize for any inconvenience this may have caused you and appreciate your patience.”

    UPDATE: Users on Twitter report continuing problems trying to log onto their apps. Saleforce.com reported a second, briefer outage affecting its NA2 instance. “The problem began at 1:49PM PST and was resolved as of 1:56 PM PST,” the company said.

    Read More »
  • Technician Injured in Peer 1 Power Outage

    December 30th, 2009 : Rich Miller

    Colocation provider Peer 1 said a technician was seriously injured during an incident Wednesday night that knocked out power at its data center at 151 Front Street in Toronto. The company said the injuries “appear to be non-life threatening.”

    The service technician from Eaton Corp. was injured during scheduled maintenance on the uninterruptible power supply (UPS) system in Peer 1’s fourth-floor data center at 151 Front, the largest carrier hotel and data center hub in the Toronto market. The technician was replacing a failed fan in a UPS unit.

    “During the maintenance at approximately 8:46PM EST there was a visible arc flash (no confirmed cause yet) from the UPS causing 2nd and 3rd degree burns to the Eaton service technician and witnessed by our data center manager who also went to the hospital to have his eyes checked for retinal burns,” said Ryan Murphey, Vice President of Facilities & Data Center Operations for Peer 1, in an e-mail update. “The Eaton service tech was transferred to another hospital with a burn unit and was in serious/critical condition and our data center manager was treated and released.”

    All staff were cleared from the fourth floor suite while the fire and police investigated, but were cleared to return and power was restored about three hours after the incident. Peer 1 said it had additional staff on hand to provide support for customer site restoration.

    “Wrap around bypass was inspected for damage by our electrical service company and engaged after given approval by the building, fire department, Toronto Hydro and the MOL (Ministry of Labor) restoring power to the data center suite at approximately 11:31PM EST,” Murphey wrote.

    Peer 1 provided customer updates on the incident on its support forum.

    Read More »
  • DNS Issues Cause Downtime for Major Sites

    December 23rd, 2009 : Rich Miller

    Some of the web’s most prominent sites experienced downtime and sluggishness Wednesday night due to problems with DNS services. The issues were most pronounced at UltraDNS, which reported that its performance problems were caused by an electronic attack.

    The list of sites experiencing problems included Salesforce.com, Amazon Web Services and Walmart.com. DNS is short for the domain name system, which serves as a roadmap allowing users to find web sites. Domain registries like VeriSign provide centralized web lookups through a network of data centers, while commercial DNS service providers like UltraDNS offer additional tools to manage traffic.

    UltraDNS told CNet that it was hit by a distributed denial of service attack (DDoS) targeting its west coast infrastructure at data centers in San Jose and Palo Alto. A DDoS attack targets a site or provider with large volumes of traffic in an attempt to overwhelm its ability to serve content.

    That logjam at UltraDNS caused ripples across the Internet, causing uptime  problems for several major service providers.

    Read More »

ARCHIVED ARTICLES

All Content on Data Center Knowledge
© 2009 Miller Webworks LLC
All Rights Reserved