Michael Bushong is the vice president of marketing at Plexxi.
You might be surprised to hear that for most companies, Big Data is not really all that big.
While some Big Data environments can be massive, typically deployments are 150 nodes or less. The technology and applications are still fairly nascent, so outside of a relatively small number of players for whom Big Data underpins their business, most are just dipping their toe in the proverbial Big Data waters.
But just because the initial foray into Big Data is small, doesn’t mean that architects building out infrastructure for a Big Data era can ignore scale altogether.
Scale, but not the way you would expect
In the networking world, the word ‘big’ evokes visions of massively scaled out applications spanning thousands of nodes. Based on the amount of data replication happening across these nodes, the interconnect that pulls together all of these resources will surely be enormous. Or will it?
While some web-scale properties are likely to develop and deploy Facebook-like applications, the vast majority of companies are unlikely to build their businesses around a small number of applications with this kind of scale. Instead, as initial experiments in Big Data yield fruit, companies will roll out additional small-to-medium sized applications that will add incremental value to the company. Perhaps an application that performs well for one line of business will be picked up by another line of business. Maybe a recommendation engine that yields additional retail revenue will be augmented with an application that looks at buyer history.
One by one, these applications add up. If a single small application has 150 nodes, even six to eight applications start climbing into the 1,000-1,200 node range very quickly. If you consider that Big Data is really just a specific type of clustered application (along with clustered compute, clustered storage and any distributed application), the number of potential applications to consider grows even higher.
Scaling for multiple applications is an entirely different beast from supporting a single massive application. With a single application, the requirements are uniform, allowing architects to design the networks with a specific set of objectives in mind. But, different applications have different requirements. Some applications are more bandwidth heavy. Others only run at certain times. Some are particularly sensitive to jitter and loss, while others have strict compliance (HIPAA and PCI) requirements. The challenge is that when you first design the network, you don’t know which of these requirements will ultimately hit your network.
This problem is only exacerbated when you look at how many of these smaller projects are initiated with Big Data in mind. For instance, a line of business spearheads the first move toward Big Data. They fund a couple racks of equipment, and connect them using some fairly nondescript networking equipment. It’s more of an experiment than a core necessity for the business, so the work is done on the side. This protects the rest of the network from the potential issues related to retransmission storms.
The approach of isolating new deployments makes sense, but it makes it easier to plan only for the short term. As each deployment is optimized for a specific application, architectural assumptions become narrower. When these short-term experiments evolve to business-critical features, converging multiple applications that exist in specialized or niche environments can be prohibitively painful. The only remediation is to continue with dueling infrastructures for each class of application, or to take on the burdensome task of converging on a common infrastructure.
Planning for convergence
Ultimately, the key to transitioning from the role of Big Data experimenter to that of Big Data practitioner is to plan from the outset for the eventual convergence of multiple applications with competing requirements onto a common, cost-effective infrastructure. The networking decisions at the outset need to account for a set of rich applications with different needs. This leads to a set of questions that architects should consider:
- If applications require different treatment, how are those requirements communicated to the network?
- Is the network capable of altering behavior to optimize for bandwidth requirements? Latency? Jitter? Loss?
- Can application traffic be isolated for compliance and auditing reasons?
- If growth over time leads to physical sprawl, how are issues with data locality handled? Do applications suffer if resources are 100 meters apart? 100 kilometers?
- When a cluster grows from 100 to 1000 nodes, what are the cabling implications? Do you have to change out cabling plants?
- What are the operational implications of implementing this diverse network?
Network scaling for Big Data is more than just planning for the eventual number of nodes in a Hadoop cluster. Careful consideration for how to gracefully grow from one-off, bite-size deployments to a rich set of diverse applications is of paramount importance to avoid the types of unplanned architectural transformation projects that can wreak havoc on budgets and cripple IT departments for years. Plotting a deliberate architecture course from the outset is the surest way to guarantee success.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.