Insight and analysis on the data center space from industry thought leaders.

IT Availability: The Whole Truth

Some of the figures people quote for the reliability of their IT systems look ludicrously minimal, very often through misunderstanding terminology or simply cheating.

Data Center Knowledge

February 1, 2022

4 Min Read
Laptop with healthy heartbeat
Getty Images

"You can fool some of the people all of the time, and all of the people some of the time, but you cannot fool all of the people all of the time," said Abraham Lincoln.

Without casting aspersions on the veracity of some of the figures people quote for the reliability of their IT systems, on examination, the outage times claimed by many people look ludicrously minimal. 

The first issue to look at is the mean time to repair (MTTR) of a particular failure. This may be a short or a long time, but the real problem is that it says nothing about the extra time needed to 'get the show back on the road.'

If you substitute the word 'repair' with 'recover' in the above definition, you will be closer to the truth. It may take a minute to decide that you have run two supposedly sequential jobs in the wrong order and two minutes to restart them in the correct order. However, your database will almost certainly be out of kilter as far as consistency is concerned and the 'repair' of that will take much longer.

In too many cases, financial bodies (banks, stock dealers) have repaired faults but have taken many hours to recover normal working conditions. "The system was repaired at 11am and trading commenced normally at 2.30pm" is a typical report.

This leaves us with an equation for recovery: MTTR = mean time to fix error + mean time to recover to full working mode.

The last part of the equation I have called 'ramp-up' time, representing the time needed to put the systems back into operational mode as viewed by the end business user and not the network specialist who took three minutes to repair a failing network module. A decent service-level agreement will include the ramp-up time in the recovery time specification.

The recovery time should emerge from a business impact analysis (BIA), which specifies how long a business service can be out of 'normal' action before the situation becomes critical or otherwise untenable.

It is possible for the repair action to take place while the system application is still running, for example repairing a part while a parallel redundant part takes over its job. In such a case, there will be a repair time of X minutes but a zero outage time to report because the end user sees no interruption to his or her service.

This leads me to the penultimate point: only by understanding all the steps in a failure and its recovery can you plan to minimize the times involved in each stage.

The simple diagram below illustrates this:

Fault Discovery and Recovery Elements

Fault Discovery and Recovery Elements

Glossary

  • MTPD: mean time to problem determination

  • MTTR: mean time to repair or mean time to recover. Depends on your viewpoint

Note: Repair and Recover are not the same thing though many people think so. This is a myth since the restoration or repair of a failing piece of hardware or software does not mean the affected service is available immediately the repair is complete. There are often other recoveries required as described above, namely getting the whole show back on the road as it was before the failure occurred.

The final point to make is that there are several viewpoints of an outage or period of downtime, depending on your place in an organization.

The end user's view will be that the outage lasts as long as he or she is prevented from using IT to do the job they are supposed to do.

The server specialist's view might be that the outage of his hardware was a mere minute or two before it was fixed, whereas the network person will say: "What's all the fuss about? Everything is working fine."

I remember an incident at a chemical company where the users claimed their application was unavailable for nearly two days but when challenged the IT department said all their equipment – to the extent they could monitor it then – was all working fine: a stand-off.

It transpired that the Planet ring to which they were attached had failed and nobody except the end users noticed. He rest of the network was indeed still operating to spec. I was present at the meeting where the users’ representative tackled the IT people. Quite a pantomime.

It all depends on your viewpoint and I know what viewpoint the company CEO and board will eventually take. Do you?

Dr. Terry Critchley is a retired IT consultant and author who’s worked for IBM, Oracle and Sun Microsystems with a short stint at a big UK Bank.

Books by Terry Critchley:

High Availability IT Services

High Performance IT Services

Making It in IT

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like