Bostjan Kaluza, Phd, is Chief Data Scientist at Evolven Software.
IT operations teams often focus on more than one approach to infrastructure monitoring, such as device, network, server, application and storage, with the implication that the whole is equal to the sum of its parts. According to a 2015 Application Performance Monitoring survey, 65 percent of surveyed companies own more than 10 different monitoring tools.
Despite the increase in instrumentation capabilities and the amount of collected data, enterprises barely use significantly larger data sets to improve availability and performance process effectiveness with root cause analysis and incident prediction. W. Cappelli (Gartner, October 2015) emphasizes in a recent report that “although availability and performance data volumes have increased by an order of magnitude over the last 10 years, enterprises find data in their possession insufficiently actionable … Root causes of performance problems have taken an average of 7 days to diagnose, compared to 8 days in 2005 and only 3 percent of incidents were predicted, compared to 2 percent in 2005”. The key question is: How can enterprises make sense of these piles of data?
This is basically a big data problem: large volume of data as instrumentation technologies are able to collect granular details of monitored environments; high velocity as data are collected in real-time; data variety originating from semi-structured log data, unstructured human natural language that could be found in change/incident tickets, and structured data that appear in APM events; and data veracity as a result of uncleaned, untrusted or missing measurements. And in response, IT operations analytics (ITOA) solutions are coming to market as an approach to derive insights into IT system behaviors:
- Knowing when there is a problem that affects users
- Prioritizing responses to problems on the basis of business impact
- Avoiding chasing problems that don’t exist, or deprioritizing those that don’t affect users
- Troubleshooting with a problem definition that matches performance metrics
- Knowing when (or if) you’ve actually resolved a problem
The ITOA market insights from Gartner tell an interesting story: Spending doubled from 2013 to 2014 to reach $1.6 billion, while estimates suggest that only about 10 percent of enterprises currently use ITOA solutions.
Making Sense of Collected Data
Correlating cross-silo data is not a new problem. In the past, a common correlation technology referred to as an Event Correlation Engine handled event filtering, aggregation, and masking. The next approach, which has roots in statistical analysis and signal processing, compares different time series detecting when there is correlated activity using correlation, cross-correlation, and convolution. Recently, a new wave of machine learning algorithms based on clustering applies a kind of smart filtering that is able to identify event storms.
While these techniques are useful and do make life easier by reducing the number of events entering investigation, they do not answer the key question at hand: “What is the root cause of a problem?”
Understanding how two time series are correlated does not imply which one is caused the other to spike, such analysis does not imply causation. To get beyond that, we need to understand the cause-effect relationship between data sources.
The key to effective root cause analysis lies in establishing cause-effect relationships between available data sources. It is of crucial importance to understand which data sources contain triggers that will affect the environment, what the actual results of the triggers are, and how the environment responds to the changes.
Connecting the Dots with Machine Learning
The key hurdle is establishing basic relationships between collected data sources. The main task is to correlate events, tickets, alerts, and changes using cause-effect relationships, for example, linking a change request to the actual changes in the environment, linking an APM alert to a specific environment, and linking a log error to a particular web service, etc. As we are dealing with various levels of unstructured data, the linking process (or correlation) is not that obvious. This is a perfect task for machine learning as it can create general rules between different data sources, determine how to link them to environments, and when it makes sense to do so.
Machine learning is a field that studies how to design algorithms that can learn by observing data. Machine learning has been traditionally used to discover new insights in data, develop systems that can automatically adapt and customize themselves, and to design systems where it is too complex / too expensive to implement all possible circumstances, for example, self-driving cars. And given growing progress of machine learning theory, algorithms, and computational resources on demand, it is no surprise that we see more and more machine learning applications in ITOA.
Machine learning can also be leveraged to build an environment dependency model based on environment topology, component dependencies, and configuration dependencies. On one hand, such an environment dependency model can be leveraged to apply topology-based correlation by suppressing root causes of elements that are unreachable from the environment where the problem was reported.
On the other hand, such a dependency diagram can be modeled with the probabilistic Bayesian network, which may augment the model with probabilities of error propagation, defect spillover, and influence. Building such a model is practically infeasible as it requires specifying many probabilities of influences between environment components even without addressing constantly evolving environment structure. However, by leveraging machine learning and vast amounts of data describing historical performance, it is possible to build a model that estimates all the required probabilities automatically and update them on the fly.
The analysis of collected data processed by an ITOA solution powered by machine learning now gains a completely new perspective. The data collected by separated monitoring solutions could be analyzed simultaneously resulting in semantically annotated sequence of events. The short list of possible root causes could be significantly reduced by applying probabilistic matching, fuzzy logic, linguistic correlation, and frequent pattern mining. And, finally, reasoning about the most probable root causes performed by automatic inference now takes into account environment dependency structure as well as previous incidents.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Penton.