James Sivis is VP, Sales & Marketing, Circonus, and has served as a senior executive in high-tech sales, marketing and business development. James has 25 years of experience with firms including Alcatel and Bellcore/Telcordia/SAIC.
Last month, I attended the Gartner Data Center Conference in Las Vegas and I wanted to share with you some of my impressions and insights from the event.
First, I have to say that I have seldom seen a group of more conscientious conference attendees. Networking breakfasts were busy, sessions were well attended, and both lunch and topic-specific networking gatherings had lively discussions. Each of the Solution Center hours, going well into the evening, were full of people voraciously soaking up information from the various exhibitors. Even in hallways during the day, there was a steady exchange of opinions and information. Attendees at the conference were very serious about learning from the speakers, vendors, and from their peers.
Significant Focus: Avoiding Outages, Speeding Recovery
Now let’s get to what frequently was foremost on the mind of attendees. I was somewhat surprised to find that it was a topic that's not usually on the Top 10 lists of CIO/IT initiates. What repeatedly came out first in terms of attendees’ pressing interest were the inter-related topics of avoiding IT outages and increasing speed of service recovery, along with monitoring to help with both of these goals.
Granted, this was a data center-specific conference, so it’s natural that avoidance of and recovery from operational failures is of paramount importance. But there are lots of other data center initiatives we hear much more about, such as virtualization, cloud migration and data center consolidation. Many of these headline-grabbing topics are important. However, this issue of primary importance that effects data center operations leaders’ daily lives and careers has not received much if any notice or press.
The Basics Continue to Be Important
Why is that? It’s pretty simple. Some of these other initiatives are new. Monitoring has been around seemingly forever. Plus, to an extent, outages are taken as being somewhat unavoidable. Yet while zero failures is indeed not possible, markedly increased reliability is certainly attainable. Look at the historical telecom service provider side, where five-9’s reliability is the expected level of service. When expectations are high, and commensurate investment is made, higher levels of reliability are within reach.
As for monitoring solutions themselves, nowadays you don’t have to be limited to old-school systems. There are young companies, like Circonus, who have a fresh approach that breaks down the silos of stand-alone toolsets of the past.
Let’s take a step back now and visualize what outages look like from a data center ops team's perspective, i.e. what happens when things “blow up” in a data center. It’s not external constituents such as clients that directly impact the data center for the most part. External clients touch the business units, and it’s then the business units that put the heat on the data center leaders.
What Have You Done For Me Lately?
And what about Service Level Agreements (SLAs) for keeping business units apprised of the benefit IT delivers to them? As I heard loud and clear in the Gartner conference, internal SLAs are for the most part useless. Why? Because they don’t mean much to the business units – they’re only interested in “When are you going to get my service back up?!” In other words, this is a variation on, “What have you done for me lately?”
So let’s look at an option for resolution. If the problem occurs on a virtual machine, you just spin up a new instance, right? Wrong, but that’s what usually happens. When a hammer dangling off a shelf hits you on the head, do you replace it with another dangling hammer and think you’ve solved the problem? Obviously, the thing to do in a data center is do the work to avoid repetition of the issue – we’re talking root-cause-analysis – or you’re repeatedly putting out the same fires.
How Monitoring Assists
A good monitoring system is going to help in several ways. First, it’s going to assist in identifying the underlying issue, including its location – is it in the app, the database, the server, etc. You don’t want to do that blindly testing - you’ll want the capability to create graphs on the fly and want to easily and quickly do correlations of your metrics.
Okay, so that’s good for remediating a problem along with reducing the chance of it reoccurring. But you’ll also want to do anticipatory actions like capacity planning to forestall avoidable bottlenecks. For this, you also want an easy-to-use tool so that you don’t have to muck around with spreadsheets. And you’ll want to be able to have a “Play” function so that when you do things such as code-pushes, you’ll be able to see in real-time the effect of these changes. This way, if the effect of the code-push is negative, you can quickly reverse the action without impacting your internal or external clients.
The good news is that new solutions with all these functionalities are out there in the marketplace. Of-course, before you buy one, be sure to insist on testing the solution in a trial to see how it performs, in your current and anticipated (Read: hybrid physical and virtual/Cloud) environments. This includes seeing how the solution handles your scale, both back-end and from a UI perspective. Such an evaluation will require an investment in your time, but the result will be well worth it, in the increased avoidance of outages and speeding up of recovery time.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.