Managing Huge, Fast-Moving Traffic Spikes

Startups aren't the only web properties struggling to scale their sites when large traffic spikes arrive. Established sites with years of experience in traffic management are also finding themselves struggling to manage huge, fast-moving spikes from major news sites and social media hubs. Todd Hoff at High Scalability discusses an interesting post on this subject by Theo Schlossnagle, the author of Scalable Internet Architectures. Here's a snippet in which Theo outlines the challenge of site management in an environment in which the Slashdot Effect can come from anywhere:

These spikes happen inside 60 seconds. The idea of provisioning more servers (virtual or not) is unrealistic. Even in a cloud computing system, getting new system images up and integrated in 60 seconds is pushing the envelope and that would assume a zero second response time. This means it is about time to adjust what our systems architecture should support. The old rule of 70% utilization accommodating an unexpected 40% increase in traffic is unraveling. At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients. ... We see these spikes happen inside 60 seconds and they occasionally induce a ten-fold increase over trended peaks.

So what to do? Both Theo and Todd share strategies and priorities for these scenarios.

Todd writes that "VMs don't spin up in less than 60 seconds so your ability to respond to such massive quick spikes is limited." As a result, traffic surges are likely to be managed best by sites where performance is already a priority:

Interestingly, Theo ties handling sudden unexpected spikes back to performance. We are always told performance and scalability are separate issues. And while I accept this notionally, in my heart of hearts I think they have more in common than not and I think Theo nails why. A well performing system acts as a kind of reservoir for handling spikes before you can ever notice there's a spike. That gives you some time to add more resources to your site if a spike continues. With that reservoir you are just crushed.