How to Succeed with Site Reliability Engineering

Here are steps to start and evolve an SRE practice to improve reliability.

Gartner Blog Network

October 5, 2022

4 Min Read
programmers looking at a computer screen
Getty Images

Many clients are considering site reliability engineering (SRE), but often then grapple with understanding the prerequisites and implications!

Questions often involve what actually is SRE, what does an SRE do, what skills do they need, how to start, gain value from, and evolve.

Over the past two years, Gartner has discussed SRE with clients in well over 2,000 inquiry calls, and a collection of impactful research showcased below has been created to meet our clients most pressing needs.

What Is Site Reliability Engineering?

  • SRE is a modern approach to operations that supports DevOps at scale by balancing the need for velocity against stability and risk.

  • SRE is a set of engineering principles and practices that focuses on improving customer experience and retention by leveraging service-level objectives to govern how services are managed.

  • Most importantly, SRE is not a simple rebranding of an existing operations team. It requires a collaborative engineering mindset and a demonstrated ability to continually learn, improve and share knowledge.

You can deep dive into the definition of SRE in the following Gartner Quick Answer note: https://www.gartner.com/document/4015749

Steps to Start and Evolve an SRE Practice

Organizations are under pressure to drive innovation, relying heavily on digital channels to reach their customers anywhere they are. Keeping up with market demands and customer needs is driving adoption of complex architectures, leading to a combination of cloud-native applications, SaaS, platform as a service, and third-party services and dependencies.

Related:2022 State of SRE Report Identifies Site Reliability DevOps Challenges

But traditional operating models are not designed to keep pace with the ever-increasing velocity of digital business transformation. In addition, many I&O teams struggle with a lack of skills required to handle emerging technologies and new ways of delivering software products. As a result, I&O teams are unable to achieve business or reliability objectives or meet customer expectations. This skill gap also leads to unnecessary friction between I&O, application development and product owners as they struggle to align goals and collaborate effectively.

7 Steps to Start and Evolve an SRE Practice

7-steps

 

Gartner has defined a 7-step approach to start and evolve your SRE practice – https://www.gartner.com/document/4019056

Site Reliability Engineer Job Description

Site reliability engineers (SREs) are responsible for improving system reliability and resilience to make it faster and easier to develop and deploy new software capabilities. SREs focus especially on building automation to reduce manual effort and prevent operations incidents.

Related:How DevOps Roles and Site Reliability Engineer Roles Differ

Gartner provides a sample job description to give a representative overview of the site reliability engineer role. This is designed to be customized to the specific needs and requirements of your organization. It is based on an analysis of publicly available job descriptions from organizations representing a range of industries and geographies. Data was sourced from TalentNeuron, a Gartner tool that leverages analytics to analyze job postings and provide labor market insights.

More insights into the SRE job description can be found here: https://www.gartner.com/document/4009021

Dimensions of SRE Role

dimensions-of-sre

Improve the Reliability of Large, Complex and Distributed IT Systems by Leveraging SRE Principles

Organizations today rely on large, complex, distributed software products that are creating a supply chain of internal and external service and/or platform providers. The varying goals and business models of these nonaligned, disparate providers, however, make it difficult for organizations to mitigate and contain risks — including reliability and resiliency.

Infrastructure reliability isn't just a matter of creating high availability systems or ensuring adequate system performance. It goes beyond redundancy by aligning engineering practices, technology platforms and organizational practices across all relevant internal and external environments.

Therefore, for organizations to build increased reliability to mitigate and contain risks and honor customer commitments, end-to-end transparency in the software development, procurement and a deployment process needs to be implemented. A supply-chain mindset expands the perspective from the software factory into the network of internal and external suppliers. We then pragmatically leverage site reliability engineering (SRE) perspectives to further improve:

  • Adopt a Supply-Chain Mindset to Build Reliable IT Products

  • Build and Maintain a Map of Components and Dependencies to Have End-to-End Visibility to Your IT Product Supply Chain

  • Partner With Suppliers to Optimize Interactions and Improve Reliability

This article originally appeared on the Gartner Blog Network.

About the Author

Gartner Blog Network

The Gartner Blog Network has expert views on today’s technology and business topics and trends. 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like