Site Reliability Engineering Teams Face Rising ChallengesSite Reliability Engineering Teams Face Rising Challenges

Catchpoint's 2025 SRE Report shows reliability teams are spending more time on operational tasks while grappling with evolving performance expectations.

Sean Michael Kerner, Contributor

January 16, 2025

4 Min Read
the letters "SRE" behind a magnifying glass
Alamy

The landscape of site reliability engineering (SRE) is undergoing significant changes. Catchpoint's SRE Report 2025 showcases emerging challenges in performance management, operational burdens, and organizational priorities. The comprehensive study provides insights into how organizations are managing reliability and resilience.

The report indicates that while AI adoption continues to grow, it hasn't reduced operational burdens as expected. Performance issues are now considered as critical as complete outages. Organizations are also grappling with balancing release velocity against reliability requirements.

Key findings from the 2025 report:

• Toil levels increased for the first time in five years (median 20%, up from 14% in 2024).

• 53% agree that poor performance is as damaging as complete downtime.

• 41% feel pressured to prioritize releases over reliability.

• 61% use between 2-5 monitoring tools.

• 40% handled between 1-5 incidents in the past 30 days.

For Mehdi Daoudi, CEO of Catchpoint, the biggest surprise in the report this year was the first increase in toil levels in five years.

chart showing the percent of work that is toil over the last 5 years

"Despite the widespread adoption of AI tools, which were expected to reduce toil, the report shows that toil levels have actually risen," Daoudi told ITPro Today. "This unexpected finding challenges the assumption that AI would alleviate operational burdens and highlights the complexity of integrating AIOps into SRE practices."

Related:How to Become a Site Reliability Engineer: A Step-by-Step Guide

Why SRE Toil Is Rising

Daoudi suspects that there are a series of contributing factors that have led to the unexpected rise in toil levels.

The first is AI systems maintenance: AI systems themselves require significant maintenance, including updating models and managing GPU clusters. AI systems also often need manual supervision due to subtle and hard-to-predict errors, which can increase the operational load.

Daoudi pulled quote

Additionally, the free time created by expediting valuable activities through AI may end up being filled with toilsome tasks, he said.

"This trend could impact the future of SRE practices by necessitating a more nuanced approach to AI integration, focusing on balancing automation with the need for human oversight and continuous improvement," Daoudi said.

Beyond AI, Daoudi also suspects that organizations are incorrectly evaluating toolchain investments. In his view, despite all the investments in inward-focused application performance management (APM) tools, there are still too many incidents, and the report shows a sentiment for insufficient observability instrumentation.

Related:Who Needs SRE as a Service?

"We believe these tools are generating too much noise and inactionable telemetry," he said. "There is an opportunity to fix this by incorporating DEM (Digital Experience Monitoring) or IPM (Internet Performance Monitoring) to gain visibility into the complete user experience."

How to Strike a Better Balance between Velocity and Stability

A key finding in the report is that 41% of SREs reported feeling pressured to prioritize release schedules over reliability.

Daoudi has a few suggestions for organizations to strike a better balance between velocity and stability:

  • Clear communication of objectives and key results (OKRs): Ensuring that objectives and key results are clearly communicated can help align team efforts with business goals and foster transparency.

  • Two-way communication: Establishing two-way communication between business and reliability practitioners can help adapt strategies and allocate resources effectively.

  • Reusable capabilities: Building reusable and adaptable capabilities can help manage shifting priorities and enhance resilience.

Defining Performance for SREs in 2025

The report emphasizes that "slow is the new down," with 53% of organizations agreeing. Yet many traditional SLAs still focus primarily on uptime.

As such, it's important that organizations evolve their service-level objectives (SLOs) to better reflect this shifting paradigm of what constitutes acceptable performance.

Daoudi recommends that organizations should evolve their SLOs by:

  1. Prioritizing performance metrics: Tracking performance indicators against service-level objectives and experience-level objectives can help ensure that performance is measured as a critical dimension beyond just uptime.

  2. Implementing error budgets: Using error budgets to allocate resources for upholding service standards can reduce the risk of performance degradation.

  3. Holistic performance practices: Adopting a holistic approach to performance practices that considers both internal and external perspectives can help manage performance more effectively.

"By focusing on these strategies, organizations can better align their SLOs with the evolving expectations of digital performance and customer experience," Daoudi said.

About the Author

Sean Michael Kerner

Contributor

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He consults to industry and media organizations on technology issues.

https://www.linkedin.com/in/seanmkerner/

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like