Infrastructure-as-Code with OpenStack

Boris Renski is the co-founder of Mirantis and one of the first board members of the OpenStack Foundation. He continuously evangelizes the OpenStack community through his blog, speaking at conferences, and business development activities to attract new member companies and contributors. Follow him on Twitter: @zer0tweets

BORIS RENSKI
Mirantis

Today, the truism of “change is the only constant” is a matter of operational practice for most Internet application vendors. It’s no mystery that the practice of continuous incremental software updates (continuous deployment) help accelerate engineering velocity and lower downtime risk compared to infrequent rip-and-replace cycles. Companies like Amazon are known to make material changes to their Web site applications as many as 40 times a day without users ever noticing.

In recent years, continuous deployment has been widely adopted when it comes to developing end user facing applications such as Web sites or SaaS solutions. However, as we move into the age of software-defined data centers and the boundaries between applications and infrastructure start to blur, continuous deployment is becoming core to ongoing maintenance of the infrastructure layer.

It won’t be long before infrastructure will be universally viewed and treated no differently from applications, with continuous infrastructure deployment becoming a standard across the board. After all, what’s really the difference between an IaaS platform like OpenStack and applications that run on top of it?

Adopting continuous deployment for application development may be optional. After all, you have full control over your internal development process and release cycles. However, I’d like to argue that if you decide to automate your infrastructure with an open IaaS platform like OpenStack, you MUST have a continuous deployment process for your OpenStack infrastructure in place from day one. And here is why: the key value that an open platform like OpenStack brings to the table is vendor-independence. That vendor independence is a function of the open community that can support you. The larger the community of folks and companies that know the OpenStack codebase that you run in production, the easier and cheaper it is to support your environment. But, unlike with applications that you develop in-house, community engineering velocity is not something you can control. For OpenStack, the fastest growing open source project in history, the engineering velocity is fast!

OpenStack Releases

OpenStack produces two stable releases a year, with intermediate releases every six weeks. If you follow the traditional enterprise infrastructure refresh cycle that spans a few years, OpenStack is not for you. By the time you are ready to update, nobody in the community will even remember the codebase or architecture you are running in production. The bottom line? Successful OpenStack implementation is not about getting your cluster to run. It is about implementing a cloud operations process that will continuously keep your production environment in proximity with the upstream codebase.

So what does such a cloud operations process look like and how do you implement it? You can use your in-house ops (or DevOps) team to orchestrate it, or an outside vendor like Rackspace or Mirantis to help manage the process end-to-end. Regardless of the parties involved and actual tools used, the process boils down to some flavor of a traditional CI/CD methodology, augmented with specifics for updating a live OpenStack production environment (rolling updates). I won’t go much into CI/CD, as there are plenty of public materials that cover it, but would like to share some thoughts on rolling updates.

Updates in Production

The methodology for pushing an update to a production environment is very application specific. A rolling update for OpenStack Compute is different from a rolling update on OpenStack Object Store. Let’s take a peek at what an update to Nova Compute could look like. We’ll assume a simplified scenario, where network storage (over local storage) is used for block devices attached to virtual machines.

At Mirantis, we split all updates into two categories: disruptive and non-disruptive. Disruptive updates typically take place when migrating from one OpenStack version to another and involve modifying some of the components running on the cloud controller node. Non-disruptive updates involve small patches to one of the OpenStack daemons running on local nodes i.e. Nova-Compute or Nova-Network. Depending on the category of the update, the process by which you introduce it to the production environment is different.

A non-disruptive update is, naturally, more straightforward and can be conducted with the following high-level steps:

Evacuate the node that you expect to update by live migrating all of the VMs from it.
Update OpenStack daemons running on the evacuated node, making sure you can roll the updates back to their original state.
Live migrate some VM instances onto the newly updated node.
Repeat the same for the next node.
A disruptive update involving modifications to the cloud controller is more challenging. The only relatively safe way to do this is to free up some capacity on your existing cluster (or add spare capacity), roll out a completely new version of OpenStack Compute on it and gradually migrate VMs away from the old environment.

The steps look something like this:

Assuming you are running a highly available OpenStack cluster, you’ll have at least two cloud controllers for redundancy. Therefore you’ll need to evacuate (or add) at least two empty nodes.
Deploy a new version of OpenStack on the new environment. If you have applications interacting directly with the API, make sure that your new version supports backwards API compatibility. The OpenStack community is notorious for flushing the APIs every now and then.
Evacuate more nodes on the old environment by migrating VM instances from the old environment to the new one. Again, this is going to be tricky and there is no out-of-the box way to do it since your old environment typically isn’t aware of the new OpenStack cluster… you won’t get away with just Nova live-migration command.
Incorporate the newly evacuated nodes into the new cluster by deploying new Nova daemons on them.
Continue gradually repeating the process until your cluster is fully updated.

While most of these processes are still evolving and have not yet been fully automated, the need for implementing them as part of the standard OpenStack deployment is generally recognized. Today, much of the community development effort around OpenStack is focused on stabilizing the platform and developing additional features. Not much thought is dedicated to architecture and the roadmap as it relates to simplifying the upgrade path. As the project matures, I expect to see some convergence between the efforts to simplify and automate the upgrade path and community-driven feature development.

Consequently, in the near term, upgrading a production OpenStack environment from one version to another should become as typical an operation as spinning up a new virtual machine instance.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Comments

Plain text