Internet Titans Not Immune to Downtime
September 4th, 2008 By: Rich Miller
Web site monitoring service Pingdom takes a look at the major downtime incidents of 2008, some of which will be familiar to regular readers of our coverage of Internet outages. Pingdom has categorized the incidents by type: cloud glitches, problems with service launches, data center issues and even sabotage. It’s an interesting look at the variety of challenges encountered this year.
An interesting trend is that many of the high-profile outages hit companies andf services with industrial-strength infrastructure and well-heeled owners. “One thing that the following examples clearly show is that no one is immune to downtime,” Pingdom writes. “Not Google, not Microsoft, and not Apple.”
It doesn’t matter how well-heeled a company is, nor how industrial strength their infrastructure. These outages and brownouts are going to continue until these enterprises get their Operations act together and start taking a more proactive approach to performance management. I guarantee you that every one of these companies is using a variety of siloed monitoring solutions and relying on static thresholds for individual metric measurements to determine if they are having a problem. Of course, this is not very effective because if they set the thresholds too high, by the time they get alerts end users are already calling them to complain. Set them lower and they get constant alert flow that masks the real problem precursors and they still find out about problems from end users. Most of these folks are probably not monitoring critical end user experience data and are not incorporating business performance metrics so they focus efforts on problems that are really impacting the business. Even the well-heeled with their fancy BSM dashboards and Event Management systems and complex processes and procedures cannot prevent problems from affecting end users and the bottom line of the business. Lets not even get into the affect on these company’s reputations…
So what is missing… Well… an automated “brain” that can intergrate with their existing monitoring infrastructure and understand the normal behavior of all the components that make up these complex, customer-facing business services. A solution that can add context and tell IT Operations when to pay attention and what to pay attention to. Lets face it… their current tools aren’t giving them these two critical pieces of information. In fact, these tools are confusing the issue unintentionally.
Performance management analytics solutions exist that take metric data from siloed monitoring sources and analyze it holistically, learning the normal behavior of every metric collected and sending a heads up when significant abnormal behaviors indicate a problem is imminent. These solutions often predict problems hours before occurrence and include the most likely root cause symptoms so that action can be taken to prevent them.
Until IT Operations teams embrace solutions such as these, Pingdom will have plenty to report on…