Breakthrough Failures That Help Sites Scale
Long before he was ending Roger Federer’s record run at Wimbledon, Rafael Nadal was burying the web servers at SmugMug with traffic. It all started in June 2005, when Nadal tore through the field en route to his first French Open championship. The Nadal fan site Vamos Brigade hosts thousands of images at SmugMug, which was hit with 150 gigabits of traffic when Nadal won the final.
“We basically crumbled,” said Don MacAskill, the founder and CEO of SmugMug. “There was some silver lining. Our images showed up fine, but we learned that there was something stupid we were doing on comments.”
For growing web sites, a sudden burst of enormous web traffic is almost always accompanied by learning experiences. MacAskill was among a group of web executives who shared insights about scaling for huge traffic during the recent Velocity and Structure 08 conferences. Here are some of the cases where sudden success led to scalability challenges:
- For Deal News, which tracks discounts on popular retail items, bargains on LCD televisons were the traffic magnet. Last December 18 the site’s coverage of LCD sales wound up on the front page of Yahoo. By Dec. 21, the site was seeing 40,000 unique visitors an hour, according to DealNews sysadmin Brian Moon.
- Last Christmas also provided the breakthrough traffic moment for Zoosk, a social dating network which launched its Facebook application on Dec. 19. CEO Shayan Zadeh says the Zoosk team thought the holiday break would allow them to launch in a relatively quiet time on Facebook. Instead, their app had 300,000 users by Christmas Day. “It just becomes this juggernaut, and you have to throw equipment at it,” said Zadeh.
- When photo sharing pioneer Flickr was acquired by Yahoo, it graduated to a beefier architecture – in the nick of time, as it turns out. In the early hours of July 7, 2005 Flickr moved onto Yahoo’s servers. “We flipped from Peer1 to Yahoo an hour before the London bombings, and people in London were taking photos and uploading them to Flickr,” said John Allspaw, who runs the operations engineering group for Flickr. “We saw three times the traffic we had ever seen.”
- Salesforce.com found its scalability challenges were limited by its data center neighbors. “A few years ago we did go through some serious issues, and actually it was because of eBay,” said Salesforce co-founder Parker Harris. “We needed more power and space from our data center, but they said they could give us more power but not more space, because eBay was in that data center. So we picked up and moved data centers. The scale-up was so complex, and we had the top people in the industry and it was too complex at that point. Changing everything and pushing the scale created a layer of complexity. We eventually worked it out and scaled it up.”
- eBay has had some scaling issues of its own over the years. “We had a few catastrophic issues in the 1999 to 2000 time frame,” said James Barrese, eBay’s VP of systems and architecture. “We knew we were broken because the site was down. In that day people were very forgiving, but what we had was a major outage. Now that doesn’t happen. A lot of things are automated now.” That includes a logging system to provide detailed information on performance problems. “We can flag and identify problems very quickly,” said Barrese. “It’s not very sexy, but if you don’t have it, you’re shooting in the dark.”