<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Dealing With Downtime</title>
	<atom:link href="http://www.datacenterknowledge.com/archives/2008/07/11/dealing-with-downtime/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.datacenterknowledge.com/archives/2008/07/11/dealing-with-downtime/</link>
	<description>News and analysis about data centers, cloud computing, managed hosting and disaster recovery</description>
	<lastBuildDate>Mon, 13 Feb 2012 12:57:21 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
	<item>
		<title>By: Steve Henning</title>
		<link>http://www.datacenterknowledge.com/archives/2008/07/11/dealing-with-downtime/comment-page-1/#comment-389</link>
		<dc:creator>Steve Henning</dc:creator>
		<pubDate>Fri, 18 Jul 2008 23:09:16 +0000</pubDate>
		<guid isPermaLink="false">http://dev.datacenterknowledge.com/archives/2008/07/11/dealing-with-downtime/#comment-389</guid>
		<description>There is no doubt that the downfall of some online businesses will be poor handling of unplanned downtime. In my opinion, the more information provided about what went wrong and what is going to be done to solve the problem, the better.

Of course the reason many of these businesses prefer denial to disclosure is that they never actually get to the root cause of re-ocurring brownouts and outages in the first place. How can you confidently address your users if you don&#039;t know what happened and how to avoid it in the future?

When you are managing performance and availability of your online service using static threshold-based monitoring and tribal knowledge, getting to root cause is a massive, labor intensive effort. You sift through thousands of alerts from each IT silo trying to find out which are relevant and which are not. You do human correlation based on experience. If you don&#039;t find the problem quickly, you end up re-booting servers and moving on to the next problem before a root cause is determined. And, of course, the problem keeps re-occuring and you keep re-booting.... With limited resources, post-mortem problem analysis is an afterthought and it seems better to say nothing rather than admitting to your user community that you have no idea what caused the outage and have no plans for how you will eliminate re-occurrences.

These organizations need to be looking at real-time analytics-based solutions if they want to have a prayer of getting ahead of these problems and fixing them for good. These solutions eliminate the need to set static monitoring threshold by learning the normal behavior of the entire infrastructure supporting the online service. They also do automated correlation of abnormal behaviors to pinpoint root cause of performance degradations and outages in real time - something that is not humanly possible when you are dealing with thousands of devices and hundreds of thousands of metrics. These solutions also proactively alert to problem behaviors before they occur so that brownouts and outages can actually be prevented. IT Operations is then out of its traditionally reactive mode.

In my opinion, until online businesses adopt these types of solutions we&#039;ll be hearing a lot of silence when unplanned downtime occurs...
</description>
		<content:encoded><![CDATA[<p>There is no doubt that the downfall of some online businesses will be poor handling of unplanned downtime. In my opinion, the more information provided about what went wrong and what is going to be done to solve the problem, the better.</p>
<p>Of course the reason many of these businesses prefer denial to disclosure is that they never actually get to the root cause of re-ocurring brownouts and outages in the first place. How can you confidently address your users if you don&#8217;t know what happened and how to avoid it in the future?</p>
<p>When you are managing performance and availability of your online service using static threshold-based monitoring and tribal knowledge, getting to root cause is a massive, labor intensive effort. You sift through thousands of alerts from each IT silo trying to find out which are relevant and which are not. You do human correlation based on experience. If you don&#8217;t find the problem quickly, you end up re-booting servers and moving on to the next problem before a root cause is determined. And, of course, the problem keeps re-occuring and you keep re-booting&#8230;. With limited resources, post-mortem problem analysis is an afterthought and it seems better to say nothing rather than admitting to your user community that you have no idea what caused the outage and have no plans for how you will eliminate re-occurrences.</p>
<p>These organizations need to be looking at real-time analytics-based solutions if they want to have a prayer of getting ahead of these problems and fixing them for good. These solutions eliminate the need to set static monitoring threshold by learning the normal behavior of the entire infrastructure supporting the online service. They also do automated correlation of abnormal behaviors to pinpoint root cause of performance degradations and outages in real time &#8211; something that is not humanly possible when you are dealing with thousands of devices and hundreds of thousands of metrics. These solutions also proactively alert to problem behaviors before they occur so that brownouts and outages can actually be prevented. IT Operations is then out of its traditionally reactive mode.</p>
<p>In my opinion, until online businesses adopt these types of solutions we&#8217;ll be hearing a lot of silence when unplanned downtime occurs&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Janke</title>
		<link>http://www.datacenterknowledge.com/archives/2008/07/11/dealing-with-downtime/comment-page-1/#comment-388</link>
		<dc:creator>Michael Janke</dc:creator>
		<pubDate>Fri, 11 Jul 2008 23:10:34 +0000</pubDate>
		<guid isPermaLink="false">http://dev.datacenterknowledge.com/archives/2008/07/11/dealing-with-downtime/#comment-388</guid>
		<description>Having had to explain unplanned outages to a hundred thousand or so customers a bit more than I&#039;d have preferred, I agree that straight talk when it comes to problems, performance or outages is the only path to take. I&#039;m surprised though, at the number of executives/managers/directors/PR people who would rather ignore or deny the obvious.

Your customers already know that you&#039;ve had a service affecting outage. You might as well own up to it.
</description>
		<content:encoded><![CDATA[<p>Having had to explain unplanned outages to a hundred thousand or so customers a bit more than I&#8217;d have preferred, I agree that straight talk when it comes to problems, performance or outages is the only path to take. I&#8217;m surprised though, at the number of executives/managers/directors/PR people who would rather ignore or deny the obvious.</p>
<p>Your customers already know that you&#8217;ve had a service affecting outage. You might as well own up to it.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

