Root Cause Analysis: An Alternative to Blamestorming

David Mavashev is CEO of Nastel Technologies, a provider of APM monitoring.

"Blamestorming" – to my surprise this term is actually in the dictionary or at least dictionary.com. The definition is as follows, “an intense discussion or meeting for the purpose of placing blame or assigning responsibility or failure.” How is this relevant to IT Operations or Healthcare IT?

After an extraordinarily rocky start, the federal healthcare exchange – the online marketplace consumers use to purchase health insurance under the Affordable Healthcare Act, a.k.a. “Obamacare” – seems to be working more smoothly. But, now problems are cropping up with the state healthcare exchanges. Media reports highlight state-level exchange system issues seemingly every week. This shouldn’t be a surprise as we are dealing with a highly complex system.

When you alleviate a bottleneck in one location of a complex system, the result is often a newly visible series of bottlenecks in other locations. The transactions now flow past the prior bottleneck only to hit another logjam in different area of the system. The rule of thumb in performance analysis is analyze before you fix and focus on the most significant bottleneck, first. The state of affairs will change once that issue is relieved and you then focus on the next significant issue. However, this well-worn IT approach is not always followed.

Some states have singled out vendor software as the culprit. Others blame a lack of comprehensive testing or inter-operability. Still others cite inconsistent project leadership and failures to address known issues in time to achieve a smooth rollout. Some or all of these glitches may sound familiar to CIOs and IT executives who have spearheaded the launch and maintenance of a complex system.

The Rush to Point Fingers

CIOs may also recognize a familiar tone from people quoted in the news reports – the rush to affix blame. When a complex system doesn’t work, groups that handle components of the larger system tend to focus on deflecting responsibility from their unit. It’s important to find out what went wrong, but a more fruitful discussion would focus on identifying root causes like scalability and infrastructure monitoring capacity.

There are a number of possible explanations for a troubled system rollout. Clearly, the system lacks the capacity to handle anticipated demand. Was the anticipate demand known? In this, the answer is decidedly “yes”. Or worse, there were no requirements for testing loads that simulated anticipated demand. Were there a clear set of “user stories” that illustrate what the system must do to be effective? User stories, as part of an agile development environment often include performance expectations and should also cover the range of users expected to utilize the application

A friend of mine told me about their endless troubles in registering for healthcare. This person is a private instructor with irregular hours who fit the profile of the type of user this program was supposed to address. Previously, she was not able to get affordable healthcare and had hoped that this would address her needs. It might actually do that, if she could get registered. When she tried to register, the website application told her that her income she entered on the website did not match what the state had on file. It turned out the application wanted future income for the current year end. But, since she is not an employee with a regular salary there was no way to do that with certainty. The application made assumptions that didn’t fit the target audience. Apparently, the user stories created were not appropriate or complete. At this point she still has not made it through the application process.

Alternatively, there may have been flaws in the architecture or perhaps, coding bugs could be responsible. Maybe, there’s a database access issue. Any and all of these explanations may play a role, but here’s the fundamental problem: The technology professionals charged with resolving the issue typically work in silos, and the person in charge may feel overwhelmed by the sheer volume of analysis and speculation. This is especially true when past experiences inform them that all this painstaking work produced little to no results.

See more on the Next Page

When companies experience this type of scenario, the impulse is to gather everyone together in one room to get to the bottom of the performance problem. And that’s when the “blamestorming” usually starts in earnest, as IT management many times focuses more on pinning the responsibility on someone rather than working together to resolve the problem. Maybe, they just don’t have the necessary visibility into their application environment to do it any other way.

Does the Plan Fit the Requirements? What's Plan B?

One way to avoid this situation is to think through the need to ensure the application meets user requirements. There also must be a plan for high availability, reliability and scalability for all of the system’s component applications before a problem occurs. This seems like common sense – and it is. And a plan must be put in place to ensure that end-to-end visibility is available so that when a problem occurs, and it will the necessary insight into the details and cause is available for rapid diagnosis and remediation.

But when IT departments are rushing to meet deadlines, many rely solely on existing silo-based infrastructure monitoring tools (e.g. network, server, webserver, application server, database) that do not provide "situational" visibility. They provide a stove-pipe view of what’s going on in their domain and no clear way to differentiate between symptoms and causes. They don’t build in ways to address the need for scalability nor do they enable root cause analysis at the application level.

Different, Proactive Approach to Monitoring

IT leaders should think about a solution that goes beyond threshold-based monitoring of the performance of networks, web browsers, databases and servers. A more proactive approach would involve a solution that monitors situations comprised of events from multiple sources, corresponding to expected user story scenarios. In this way we align the business cases and our monitoring. We monitor what matters. Instead of threshold violations, most of which are false alarms we get actual problem alerts. In addition an effective solution should track the message and transaction flows the applications invoke; thus, delivering visibility across applications. This solution should utilize real-time analytics to find problems before users do, diagnosing probable cause and predicting potential failures. Now, instead of being the villain the users blame, IT can be the hero that catches a problem before anyone even knows. This is a much better option than blamestorming.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission processfor information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text