Omer Trajman is Co-Founder and CEO of Rocana.
What's in your data center? While a seemingly innocuous question, the increasing levels of abstraction such as IaaS, PaaS, and SDNs present a real challenge to IT operations and security teams. The dynamic allocation of applications and components adds yet another level of complexity to IT operations. How do IT teams inventory what systems and software are running when they are constantly changing? How do you debug a performance problem when the application code may be migrating from on-premise to off-premise servers and back again?
Consider the case of Thor, a 44-year old IT admin for a Fortune 500 telecommunications firm. When Thor started his career, he managed a small set of servers that ran enterprise applications. Each application was installed on a specific server, and Thor and other admins gained familiarity with each of these servers and their interconnections. This "tribal knowledge" was the basis by which troubleshooting was done. Thor would hear other IT admins say things like, "Oh, yeah, that server connects to the Sun workstation in building 4200. That network connection is a little flaky." Now consider Thor’s plight as he manages an SOA application with several thousand Java components that talk to cloud servers managed by SaaS application providers. And those SOA application components are implemented as a PaaS, which dynamically scales the number of nodes up and down to meet demand. How can Thor determine whether the users are experiencing performance problems because node scaling is not keeping up, or if there is a systemic problem with the connection to the cloud-based application database?
Infrastructure complexity has brought about the death of tribal knowledge. At the same time, monitoring and management tools haven't kept pace with the rate of change of underlying technology. Of course, vendors have tried to solve the problem for their part of the stack, leading to a proliferation of monitoring and management silos. In order to answer a seemingly simple question like, “What systems and software are running and where?” in a modern infrastructure, Thor might have to consult a half dozen tools or more to get raw data, and then struggle to merge the data into something sensible. It may be possible for Thor to “brute force” his way to a monthly report, but it certainly would be a time-consuming, error-prone, and headache-inducing process.
Often these siloed, domain-specific tools also limit the data available by retaining data only from the sources they deem important, or by limiting data retention to extremely short periods, or both. Much like politics, the perspective provided by each of these silos creates factions, with differences of opinion that are difficult to reconcile. Since every group is working from a different set of information, there is no common base of information across teams. So, how are performance problems and outages resolved?
Here are “Seven Must-Haves for Monitoring Modern Infrastructures” as a way to help people like Thor answer the previous question:
- Collect data from all systems in one repository so there is a “single source of truth” for all to share.
- Maintain all machine data (syslogs, application logs, metrics, etc.) for extended periods of time - months for performance data and years for security data - so you can go back in time for forensics and to test models.
- Create a fault-tolerant, loss-less data collection mechanism.
- Ensure monitoring systems are more available and scalable than the systems being monitored (credit to Adrian Cockcroft of Netflix).
- Perform real-time analysis on data so issues can be surfaced before they become crises.
- Use anomaly detection and machine-learning algorithms to create an “augmented reality” for IT admins, helping guide them through TBs or even PBs of data.
- Provide a publish/subscribe mechanism that has real-time aggregation, transformation, and filtering of data for sharing with visualization tools, R-based models, and other tools.
With this powerful “hammer” in hand, IT admins like Thor can begin implementing solutions that bypass the brute force approach and start augmenting operations. You will be able to answer the question, “What’s in your data center?”, and gain awareness so that you can also answer the question, “What’s going on in your data center?”
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.