IBM Generator Failure Causes Airline Chaos
October 12th, 2009 By: Rich Miller
A generator failure Sunday at an IBM data center in Auckland, New Zealand crippled key services for Air New Zealand, prompting the airline’s CEO to publicly chastise Big Blue for the failure. The data center outage crashed airport check-in systems, as well as on-line bookings and call center systems Sunday morning, affecting more than 10,000 passengers and throwing airports into disarray.
The problem occurred during planned maintenance at IBM’s Newton data center in Auckland. A generator failed during the maintenance window, dropping power to parts of the data center, including the mainframe operations supporting Air New Zealand’s ticketing. IBM says service was restored to most clients within an hour, but local media reports say Air New Zealand’s ticketing kiosks were offline for up to six hours.
Air New Zealand chief executive Rob Fyfe is not happy.
“In my 30-year working career, I am struggling to recall a time where I have seen a supplier so slow to react to a catastrophic system failure such as this and so unwilling to accept responsibility and apologise to its client and its client’s customers,” Fyfe wrote in an email to IBM, which then became public.
“We were left high and dry and this is simply unacceptable,” Fyfe added. “My expectations of IBM were far higher than the amateur results that were delivered yesterday, and I have been left with no option but to ask the IT team to review the full range of options available to us to ensure we have an IT supplier whom we have confidence in and one who understands and is fully committed to our business and the needs of our customers.”
Daniel MatthisPosted October 12th, 2009
That one is going to leave a mark.
TDPosted October 12th, 2009
Ouch! I guess the lesson is if you’re going to rely on a generator don’t neglect it’s maintenance or any other part of the system no matter how trivial it may seem. Of course even well maintained equipment can fail. So have backups for the backup equipment? If the maintenance involved shutting down the main power and the generator became the primary power source then yes they should have had a backup generator.
Some questions for the system designers are how many simultaneous failures of equipment are acceptable and what are customers expectations?
Clearly the CEO of the airline expected that by hiring IBM nothing would fail ever. Is that realistic?
[...] pathetic performance, blowing Air New Zealand’s computer systems right out of the water (and here). The inability of data center providers to keep the lights on is getting [...]
“Clearly the CEO of the airline expected that by hiring IBM nothing would fail ever. Is that realistic?”
Based on their marketing speak and the amount they charge you for their services, it would certainly appear that they’re trying to project just such an image
WFPosted October 13th, 2009
Can you get more details about this outage, i.e. what redundancies were built into the power delivery systems such as N+1 gensets and amount of UPS strength. If this was truly a concurrent maintainable facility then a single generator failure shouldn’t have caused such an outage. Thanks
Here’s some additional information from an interview with Air New Zealand’s CIO, Julia Raue:
The incident occurred while running the main datacentre deliberately on generator power, in order to conduct maintenance on the uninterruptible power supply (UPS) system, Raue says.
“The intention was for the IBM team to bring down the UPS for maintenance, and run all systems on generator power deliberately bypassing the UPS during this maintenance window.
One hour into the window, the generator failed leaving all systems with no power,” says Raue.
The quickest expedient was to shift the systems back to mains power and this was done within “a matter of minutes”.
Unfortunately there had been a “crude and unclean shutdown of all systems”, she says. “On restart, some data corruption and reboot issues were experienced across various platforms.” Some key systems were then brought up at the secondary site.
The account isn’t explicit, but the references are all to a single generator.
ACPosted October 13th, 2009
Lashing out at a service provider never helps the business relationship. Why would a major airline not have redundancy within the databases and applications? Relying on one data center goes against even a 2 page BCDR policy statement…
I can tell you from my years of experience that both parties have equal blame in this outage; however, I would put more blame on the airline then the data center.
Data Centers need to maintain there systems properly and to do so equipment needs to be completely shut down to do full maintenance; however, it is never good a good idea to rely on back up systems to run your site if there is a reliable utility present at the site; IBM must have a standard MOP that requires generators to run during PM’s, which should be looked at.
The customer on the other hand has put their critical systems into an N facility (unless they were told a story about the site’s redundancies), which by all accounts is a very bad idea, they not only had a single generator, they also only had a single UPS feeding their equipment. Seems to me that the CEO of the airline should be looking within his org to find out who made the decision to colo at this site and lash out there and then begin the search for higher Tier concurantly maintainable site.
ChrisPosted October 17th, 2009
I thought no one ever got fired for buying IBM…
GrantPosted October 20th, 2009
As a NZer living overseas this caught my eye… I’ve seen many stories and there does seem to be a lot of detail missing… what was the resilience available? As suggested, running this sort of facility without N+1(at least) seems like a BAD idea… but what are the restrictions imposed on IBM in terms of the facility (managed or owned by IBM?), costs, etc. And it seems poor practice not to expect (demand?) this. All fingers (and articles) point at the mainframe but what else was involved in the SERVICE outage? At the end of the day a power failure will take out midrange, storage and whatever else is connected, so its hardly the mainframe’s fault. “On restart, some data corruption and reboot issues were experienced across various platforms.” – maybe it wasn’t isloated to one platform…
[...] Generator uitval bij IBM datacenter; luchtvaart chaos – Auckland Nieuw Zeeland. [...]
DSPosted June 1st, 2010
Should have had a back -up plan in place when primary generator failed. Go to alternate generator. Rent a back-up generator next time dummy.
MichellePosted August 26th, 2012
AIR NZ spent years deferring spend on the most basic requirements Furthermore they then let a huge amount of highly skilled staff go which resulted in very poor relationship and vendor management
The blame game is no use to anyone and the Air NZ approach to vendor management to this day is famously archaic
Where else in the work but Kiwi land would you have the CEO an d CIO sledging a supplier who they dont listen to