DevOps logo in code Getty Images

Why Site Reliability Engineering Is Key to Modern DevOps

Among the hottest areas of growth in DevOps is the emerging field of site reliability engineering as organizations look to bake reliability into the earliest stages of the software development cycle.

DevOps is all about integrating different skill sets, including development and operations, into a cohesive workflow.

An increasingly important element of the DevOps model is site reliability engineering (SRE). In a session at the Interop Digital 2020 event last October, Jayne Groll, CEO of the DevOps Institute, said site reliability engineers are becoming more in demand for DevOps. Like most areas of technology, there is a lot of nuances to the practice of site reliability engineering, how it relates to the broader topic of DevOps and the best practices for success.

While SRE is now a hot area, it's one that has been building for the last several years.

Tammy Butow has had the title of site reliability engineer since October 2015, first for two years at Dropbox and for the last three years at chaos engineering vendor Gremlin. As an SRE, Butow has a number of responsibilities, including performing postmortems on outages and improving the mean time to respond to issues, she told ITPro Today. As an SRE, Butow said she prefers getting involved as early as possible in the application development process to help bake service reliability into the core architecture of a project.

"I think that over time, there will be more and more site reliability engineers and more job opportunities as it's really a growing area," Butow said.

The Intersection of DevOps and SRE

There is some debate about how DevOps and site reliability engineering principles intersect, or if they should be separate domains.

Leonid Belkind, co-founder and CTO of reliability startup StackPulse, told ITPro Today that he defines site reliability engineering as an implementation of DevOps principles intended to make software services resilient.

Kit Merker, chief operating officer of Nobl9, another reliability startup, told ITPro Today that the terms "DevOps" and "SRE" are often conflated or misused. While SRE and DevOps share similar principles, Merker said his firm is seeing SRE practitioners popping up everywhere who are taking a very focused approach to improving reliability of software services.

"SRE is specifically focused on meeting business-defined service-level objectives [SLOs] consistently and efficiently, while DevOps has become a more general term for developer infrastructure and infrastructure automation," he said.

Where SRE Fits Into the Development Lifecycle

In Belkind's view, site reliability engineering begins at the earliest stages of the software development lifecycle—at the planning and architecture stage—and then "injects" itself into every step on the way. Having reliability deeply integrated into the development process is what allows it to be efficient, he added.

"Think about it, what is easier: taking a system that has been developed without any prior thought on its reliability in production and trying to make it reliable, or thinking of how we can make sure it is reliable all the way through planning, development, delivery, refactoring, etc.?" Belkind asked.

Merker shares the notion that SRE needs to be deeply integrated into product development. In modern development, reliability has become a core product feature, he said.

"If you don’t define the reliability of a service clearly, you can’t engineer a solution that meets those needs," Merker said.

Using Service-Level Objectives to Measure Reliability

According to Belkind, the industry-accepted framework for measuring the reliability of software services is service-level objectives. These objectives have to be connected to business objectives, such as the availability and service level for users for a given service or application.

"The quality of site reliability is then measured in improvement in the service-level objectives as a function of cost," Belkind said. "Efficient site reliability engineering processes introduce more improvement in SLOs for less cost for the organization."

In Merker's view, the most critical metric for SRE is measuring whether an organization's customers are actually happy with the service that is being delivered.

"You also want to know how many near-miss outages you are preventing before they happen, before they impact users," he said.

Best Practices for SRE

In terms of best practices for SRE, Merker recommends that the first step be clearly defining the reliability goals for each service and then figuring out how much unreliability the organization can tolerate while still delivering an excellent experience to end customers.

"By setting clear, realistic standards of reliability, usually expressed in service-level objectives, an organization can start to run faster while exceeding customer expectations," Merker said.

For Belkind, the best practices for enabling SRE involve both technical and cultural aspects within the organization. While there are many similarities between organizations that have fully adopted site reliability engineering, every organization ends up finding the right balance that fits its business model and technological stack, he said.

"The biggest hurdle in adopting site reliability engineering is an inability to change the culture," Belkind said.

TAGS: Uptime
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish