Do you have enough servers to handle the growth of your web site? And if so, do you have enough data center space to house all your servers - including the ones you'll need to add between now and the time that new data center space is ready?
Capacity planning is one of the most critical and challenging tasks for managers of growing web sites. Today's social networks and cloud builders are finding that much of their capacity planning is taking place in uncharted territory. At the Structure 2010 conference Thursday, infrastructure managers for some of the web's largest sites compared notes on how they scale to deliver content to hundreds of millions of users.
"We really look at the only data you have - the past predicting the future," said Mark Williams, VP of Operations for the Zynga Gaming Network, which has more than 240 million people playing it social games every day.
Most of Zynga's users are on Facebook, which has surged past 400 million users. Like Williams, Facebook's Jay Parikh emphasized the importance of data, and the right tools and technology to gather the data you need.
"You have to make sure you have the right instrumentation and the right metrics," said Parikh, the director of engineering at Facebook. "Understanding what's happening to the system is critical to detecting problems."
Load Testing for New Apps
Facebook is constantly adding new features, even as its partners are launching new applications and games. An important challenge is trying to calculate the impact a new feature may have on Facebook's infrastructure. Parikh said the Facebook team conducts load testing by introducing new applications to small segments of its user base to gather data on how it will perform once it's deployed across the Facebook platform. this testing allows Facebook to make any adjustments or fine-tuning prior to launch.
"At Facebook we try to make sure the infrastructure is a couple steps ahead of where the software engineers want to take it," said Parikh. "We strive to ensure that the infrastructure never gets in the way."
The viral growth of games on Facebook and other social networks is a scalability challenge that extends beyond Zynga's infrastructure. "There have been times that new features have begun to cause problems for partners we depend upon," said Williams, who notes that Zynga relies heavily upon its ties with Facebook, Paypal and iTunes.
Combining Data Centers and the Cloud
Zynga manages its infrastructure using a combination of its own data centers and servers running on Amazon Web Services' EC2 service, which are managed by Rightscale. "We will continue to be balanced across both data centers and the public cloud."
Tom Mornini, the CTO and founder of Engine Yard, said the challenging nature of capacity planning makes the argument for cloud computing. "We have some of the largest web companies in the world up here, and they're saying they have trouble forecasting capacity," said Mornini. "It's totally indicative of why, short of massive scale, people shouldn't think about racking their own servers."
The ability to scale quickly to manage traffic spikes is a key selling point for cloud services. But it's not simple, as noted by participants in a session Wednesday at Velocity 2010 that looked at auto-scaling web applications on cloud platforms.
"I'm always worried about the magical view of the cloud," said George Reese, Chief technology Officer at enStratus, which provides cloud management. "That leads to disappointment, and I've seen lots of disappointment with auto-scaling."
A Good Problem to Have
Panelists at Structure 2010 said the need for large-scale capacity planning is a derivative of success. "Scale is a good problem to have," said Todd Papaioannou, VP of Cloud Infrastructure at Yahoo, who said that data is important, but good planning is even more important.
"The only way to get around it is to separate out your architecture," he said. "You need to really think hard about your abstraction layers, because you're going to be living with them for a long time."
Matt Mengerink, the VP of Site Operations at Paypal, agreed that good planning is crucial - and that includes planning for challenegs that you can't see coming. "Having the flexibility to rearchitect when we need to is crucial," he said. "You're never going to get it all right."
Here's a video of the Structure 10 panel, courtesy of GigaOm: