Building the Next-Gen IT Management Infrastructure

Deep Bhattacharjee is Head of Product Management at ZeroStack.

Cloud native apps are now being built using distributed systems, clustering and built-in fault tolerance so that a failure of any component cannot bring the application down. Furthermore, the application can be scaled on demand.

So, why can't we build the IT management systems that way? They are nothing but a meta-app that converts bare metal hardware in to a software-driven cloud that can be consumed via APIs.

In the past I have argued that management systems are like puppies that need special attention. Their installation, maintenance and upgrade significantly increase the operational expenses of running an enterprise datacenter. Think about how Boeing builds new planes – every new model is better than the previous generation planes in fuel efficiency, level of automation, etc. That cannot be said of IT infrastructure management systems.

So, how should new infrastructure management systems be designed and built?

The system should be highly available by default. In the past, automobiles came with the option of anti-lock brakes. Not anymore. All cars come with these by default. IT management systems that need to take care of serious workloads should be highly available by design. It should no longer be necessary to force customers to configure HA and then manually maintain the HA setup.
Management systems should choose their persistent store wisely. These systems generate a lot of data, not all of which is transactional in nature. Out of the millions of stats that are generated, does it really matter if a few are dropped? Customers end up spending a lot of time in monitoring, tuning, and adding capacity to these databases. Managing these traditional databases incurs huge cost in licenses and operational bottlenecks. The next-generation management systems should be designed with NOSQL databases as the main persistent layer with pockets of SQL where ACID properties are absolutely essential.
Scale must be construed and handled differently than it is now. The problem of distribution often gets misconstrued as one of scale. Customers are more likely to have multiple data centers across geographies, rather than one very large data center where all of their infrastructure lives. IT management systems should be designed to be multi site-aware from the ground up. This would also include scenarios where the customer has a dual cloud strategy but wants a single management interface.
System components should scale linearly. The previous generation of converged infrastructure (VCE, FlexPod, etc.) offered compute, storage and networking in a single system but one had to start big even if they had very few workloads and had to grow into this infrastructure. This meant sitting on cash that you cannot spend as well as deprecated assets.
Modular design should be the norm. A customer may have a large number of clients (developers or operators) making API or CLI calls to the IT management system. The actual number of VMs under management may be low, but the sheer volume of API calls can bog down the performance of the system. The API service should be separate from the VM scheduling service (say) and both should be able to scale independently without any user intervention. This is when real operating expense savings occur. Without something like this it would probably take a month to debug what the real issue was and another year to ask your vendor to provide a fix.
Customers upgrade their smartphones quite often to get the benefits of new features. Why don’t they do it for the management systems? It is fundamentally hard. Management systems should be designed such that patches and upgrades are automated and do not need a month long planning process.
On premise vs. running in the cloud. Most enterprises still want their workloads and data to be on premises under their complete control. However, this does not mean all of the management components also live within the enterprise. By dividing the IT management control plane into two components, one can keep the main control plane on premises. This would include the compute, storage and SDN layer. However, the operations and consumption layer can run from the cloud. This is a huge advantage for 2 reasons:
- Our experience has shown that the core infrastructure piece does not change that often.
- Customers ask for features that are mostly delivered from the consumption layer. Having this delivered as a SaaS component makes these features instantly available to customers within weeks or months from asking for them as opposed to years. An agile enterprise can now get management features delivered in an agile way too.

The next-generation IT management system needs to be self-operating as opposed to a set of software components managed by humans.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text