Resource Sharing Unleashes Performance Storms on the Data Center

Jagan Jagannathan is the founder and chief technology officer of Xangati.

We have all experienced the good and the bad in the world of computing. We share files on a server, we share a network for sending and receiving email, and we share resources as a number of people try to establish and participate in a web conference.

Today’s data centers are also sharing more and more resources, leading to better return-on-investment as its capacities are better utilized. However, while high capacity utilization is generally good, it could lead to situations such as users standing by the printer waiting for their printout to emerge.

Caught in the storm

When critical resources are shared to capacity limits, shared computing environments can suffer spontaneous contention “storms” impacting the application performance and creating a drag on end-user productivity.

At Xangati, we talk about “performance storms,” likening them to stormy weather that comes up seemingly out of nowhere and can quickly disappear leaving a path of destruction. A performance storm in the computing environment leaves destruction of your service-level agreements in its wake.

Wreaking havoc on the varied cross-silo shared resources in the data center, these storms can entangle multiple objects: virtual machines, storage, hosts, servers of all kinds, and applications. For example, you can experience:

Storage storms that occur when applications unknowingly and excessively share a datastore, deteriorating the storage performance.
Memory storms that occur when multiple virtual machines (VMs) access an insufficient amount of memory. Or memory storms can occur when a single VM “hogs” the available memory. In either case, performance takes a hit.
CPU storms that occur when there aren’t enough CPU cycles or virtual CPUs for virtual machines, leaving some with more and others with less.
Network storms that occur when too many VMs are attempting to communicate at the same time on a specific interface or when a few VMs hog a specific interface.

Time ticks away

One brutal reality of these storms is their extreme brevity; many contention storms surge and subside within a matter of seconds. This short window in which to capture information about a storm can severely hamper an IT organization’s ability to track down its root cause. Often, the IT folks shrug their shoulders, understanding that the only remediation is to wait and see if it happens again.

Many management solutions, at best, identify only the effects of storms. The more daunting challenge is to perform a root-cause analysis. Three challenges complicate the problem:

Real-time Insights at Scale – Providing real-time insights into interactions in the environment is critical, but is made more challenging by doing this “at scale” as the number of objects in the network multiplies. One approach to solving this problem is to build in-memory analytics into memory to quickly identify, analyze and remediate performance storms. That’s because access to data in memory is typically orders of magnitude faster than access to data on disk.
Understanding Consumptive and Interactional Behaviors – The cause of contention storms cannot be identified without knowledge of both consumptive and interactional object behaviors. Consumptive behaviors pertain to how objects consume resources, and interactional behaviors pertain to how objects interact with each other. Today, technology is available to visualize and analyze the cross-silo interactions that are causing a contention storm as they occur.
Proper Capacity Utilization – Performance storms are difficult to remedy if the environment is improperly provisioned, which is hard to identify in a dynamic, virtual environment. Nevertheless, by analyzing the links between performance and capacity, users can reallocate or otherwise provision infrastructure resources to mitigate or avoid future storms.

The challenge of scaling

Understanding what is happening on the network at any precise moment is critical to uncovering and fully understanding the causality of behaviors and interactions between objects. That capability usually requires scalability to track up to hundreds of thousands of objects on a second-by-second basis. In that environment, you gain a decided edge by deploying agent-less technology. Why? Because technologies that build on a multitude of agents do not scale.

In the end, it’s still no small feat to penetrate the innards of a complex infrastructure and track down the source of a random, possibly seconds-long, anomaly that can wreak havoc on performance. Once the anomaly passes, after all, there’s nothing left to examine. Ultimately, proper remediation comes down to deploying scalable technology for performance assurance and applying split-second responsiveness to quickly identify and eliminate any issue that could otherwise lead to significant performance loss – and worst of all, a poor end-user experience.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text