Skype, Scalability and 'Boot Risk'

Why didn't Skype predict or anticipate the problems that emerged during the recent two-day outage for its peer-to-peer IP telephony service, which is used by 220 million users? For some, the outage raised questions about the scalability of peer-to-peer technology. But as Todd Hoff notes at High Scalability, the growth of huge networks can introduce variables that can be difficult to predict and assess. "How could Skype possibly test booting 220 million servers over a random configuration of resources?" Todd asks. "Answer: they can't. Yes, it's Skype's responsibility, but they are in a bit of a pickle on this one." He continues:

The boot scenario is one of the most basic and one of the most difficult scalability scenarios to plan for and test. You can't simulate the viciousness of real-life conditions in a lab because only real-life has the variety of configurations and the massive resources needed to simulate itself. It's like simulating the universe. How do you simulate the universe if the computational matrix you need is the universe itself? You can't. You end up building smaller models and those models sometimes fail.

Todd shares his own experiences with the "big boot scenario," as well as the way these scenarios play out in centralized and peer-to-peer networks.