How to Avoid the Outage War Room

Bernd Harzog<br/>OpsDataStoreBernd Harzog

Bernd Harzog is CEO of OpsDataStore.

Most IT pros have experienced it. The dreaded war room meeting that immediately starts after an outage to a critical application or service, but how do you avoid it? The only reliable way is to avoid the outage in the first place.

First, you need to build in redundancy. Most enterprises have already done much of this work. Building redundancy and disaster recovery into systems has been a best practice for decades. Avoiding single points of failure (SPOF) is simply mandatory in mission critical, performance sensitive, highly distributed and dynamic environments.

Next, you need to assess spikes in load. Most organizations have put in place methods to “burst” capacity. This most often takes the form of a hybrid cloud where the base system runs on premise, and the extra capacity is rented as needed. It can also take the form of hosting the entire application on public cloud like Amazon, Google or Microsoft, but that carries many downsides including the need to re-architect the applications to be stateless so they can run on an inherently unreliable infrastructure.

However, even organizations that have designed their infrastructures to account for all of the common outage scenarios regularly encounter trouble. How can this be? The primary reasons are:

  1. Most enterprises do not have the monitoring tools in place to know the current state of their systems and applications in anything close to real-time. Most of the monitoring tools that enterprises rely on were designed for environments that prevailed a decade ago when systems were not distributed and dynamic and applications did not change daily.
  2. Enterprises do not accurately know the current state of the end-to-end infrastructure for a transaction (everything from the transaction to the spindle on the disk that supports that transaction), so enterprises have no way to possibly anticipate problems and deal with them before they become outages.
  3. Most outages are preceded by performance problems. In other words, in most cases, outages do not occur suddenly. Performance problems show up as increased response times and reduced rates of transaction throughput. So “blackouts” are most often preceded by “brownouts.”
  4. But most enterprises have extremely immature approaches to understanding both end-to-end transaction response time and throughput (across the application stack) and full stack transaction response time and throughput (again from the click on the browser to the write on the hard disk).

In order to address the above issues a new approach to monitoring is necessary. Monitoring has to be:

  1. Focused on the correct metrics. Too many monitoring products attempt to infer performance from resource utilization metrics. This no longer works. Response time, throughput, error rates and congestion are the crucial metrics that need to be collected at every layer of the stack including the transaction, the application, the operating system, the virtualization layer, the servers, the network and the storage arrays.
  2. Real time. Every monitoring vendor claims that their product operates in real time but most really don’t. There are delays of 5 to 30 minutes between the time something happens and the resulting metric or event is surfaced in the console of the monitoring product. Look for monitoring that offers true real time monitoring, in milliseconds vs. minutes.
  3. Comprehensive. Today each monitoring tool focuses upon one silo or one layer of the stack. This leads to the dreaded “Franken-Monitor” where enterprises own many (between 30 and 300) different tools, that still somehow have gaps in between them. The other problem with this is that a plethora of tools leads to a plethora of disparate databases in which monitoring data is stored with none of them being integrated with each other. Today’s enterprises need a monitoring tool that is comprehensive with a view of the entire stack.
  4. Deterministic. Most monitoring tools rely upon either statistical estimates of a metric or rolled up averages of metrics that obscure the true nature of the problem. The focus needs to shift to actual values that measure the actual state of the transaction or the infrastructure, not an estimate or an average of these metrics.
  5. Pervasive. Many organizations only implement transaction and application monitoring for a small fraction of their transaction and applications – leaving themselves blind when something happens with an un-monitored application or transaction. The APM industry needs to undergo a big change in order to make pervasive monitoring possible for customers in an affordable manner.
  6. Embrace big data. Most monitoring tools are built around SQL back ends that limit the amount and frequency of the data that can be collected, processed and stored. In particular monitoring needs to embrace real-time big data at scale, allowing metrics to be collected from a hundred thousand servers, processed in real time and then immediately turned around for analysis and consumption.

In summary both vendors and enterprises need to take a completely different approach to the problem of monitoring performance, if they want to avoid the dreaded outage war room.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Add Your Comments

  • (will not be published)