NEW YORK - The amazing thing isn't that systems like the HealthCare.gov government web site can fail in spectacular fashion, says Dr. Richard Cook. It's that it doesn't happen more often.
"The systems we build are so expensive, and so important that we always seem to run at the edge of failure," said Cook, an expert on engineering failures from the Royal Institute of Technology in Stockholm. "Every system always operates at its capacity. As soon as there is some improvement or some new technology, we stretch it."
Cook's presentation on reliability in complex systems was one of the highlights of last week's Velocity 2013 conference, which focused on web performance and how to avoid the kind of headaches being experienced by healthcare.gov, the online insurance marketplace created by the Affordable Care Act. The site has been plagued by problems, with many users unable to access the site, and others stymied by enrollment problems.
Rush to Fix A Broken Web Site
"The experience on HealthCare.gov has been frustrating for many Americans," said the Department of Health & Human Services in a blog post. "Some have had trouble creating accounts and logging in to the site, while others have received confusing error messages, or had to wait for slow page loads or forms that failed to respond in a timely fashion.
Today President Barack Obama will announce steps to address the problems with HealthCare.gov, including additional phone support for enrollees and initiatives to fix the broken elements of the web application.
"Our team is bringing in some of the best and brightest from both inside and outside government to scrub in with the team and help improve HealthCare.gov," the department said.
It's familiar refrain for Cook. "When you have a healthcare.gov experience, everybody says 'I don't care what it costs, get it back up!'" he said. "I don't care how many people you have to put there, get it up! You don't care (cost) anymore, because you've got a big problem."
But Cook says even accidents and downtime rarely have a permanent effect on the tendency to push systems to "the hairy edge of failure."
Pushing the Operational Boundaries
Cook brings a unique perspective to systems failure. He's an anaesthesiologist and expert on healthcare safety who has also worked in engineering and supercomputing system design. His research has been used in improving systems ranging from semiconductor manufacturing to military software systems. He says it is the nature of complex systems to establish an operating comfort zone and then gradually push the boundaries.
"We make an imaginary line within the accident boundary that is our margin of safety," said Cook. "We don't have a lot of accidents, so we don't have a good idea of exactly where that boundary exists.
"So we're always flirting with the margin," he continued. "What is surprising about this world is not that there are so many accidents. It is that there are so few. The thing that amazes you is not that your system goes down sometimes. It is that it's up at all."
The Front Lines of Downtime
Many of the attendees at Velocity are working inside that margin, seeking to coax every ounce of efficiency and performance out of web sites and applications. Some are actively engaged in defining the boundaries of failure, such as Netflix with its use of the "Chaos Monkey" and other tools that introduce random failures to test the resiliency of their systems.
Cook says Internet infrastructure will only become more important, raising the bar for reliability testing.
"The future of all your systems, although you do not realize it right now, is safety," said Cook. "Your web applications systems are becoming business-critical systems. The future of your systems is to be involved intimately with some level of safety. All of your systems will become safety critical."
Here's a video of Cook's talk at Velocity, which includes a look at tools for understanding the factors in this "drift" toward the margin of safety. This video runs 19 minutes.