Invisible Downtime: The New Measure of Application Performance

The digital landscape continues to evolve, and operational teams are realizing they are essentially running blind to issues affecting the end user. This is mainly because traditional indicators of application uptime often do not reflect the true end-user experience.

Digital-first organizations engage with their customers and users through a multitude of digital touchpoints, ranging from POS systems, websites, and collaboration channels to mobile apps and even smart devices. Complex layers of application programming interfaces (APIs), cloud services, virtualization layers, and containerized software facilitate these engagements, spanning cloud, SaaS, and on-premises networks.

Teams who manage application performance, including DevOps, SecOps, AIOps, ITOps, and site reliability engineers (SREs), typically base their evaluations on service-level agreements (SLAs) and other IT-centric key performance indicators (KPIs). However, the increasing shift toward digital business models has revealed these metrics may not always align with the actual digital experience of end users.

Traditional uptime monitoring, which occurs on servers, clouds, and websites, and measures continuous availability, often leaves an "invisible downtime" gap. This is where application performance issues go undetected because they are not occurring at the server level.

Instead, they could be the fault of the application itself, the APIs it calls on to access data, or security failures. Alternatively, poor connectivity, mobile network issues, under-resourced payment gateways, or technical issues with third-party plug-ins can also impact performance.

As a result, while systems under SLA may be operating within accepted parameters, if the spinning wheel appears to users, the service itself is effectively unavailable.

This is further compounded by the increasing expectations of end users, who demand constant availability and responsiveness from applications. The root cause of performance problems is seldom clear to the end user, who simply associates their poor experience with the brand. Reliance on SLA metrics provides a skewed view in which operational teams are confronted with an environment of "unknown unknowns" that result in negative business impacts.

Application Performance and Trust — a Direct Correlation

An annual study from PwC reveals how easily trust in brands is damaged, with issues related to service being commonly experienced. Consumers and employees say these poor experiences often cause them to disengage with a brand — the impact is not trivial.

It is therefore vital that any assessment of application "uptime" and "downtime" includes invisible downtime, represented by degraded application performance. Equally important is the ability to locate the root cause of the problem.

The challenge then becomes how to assess downtime in a way that reflects the user's actual experience. While teams still need to meet their SLAs, they must also ensure applications operate at optimal performance, so strategies to achieve this are vital.

Observability over the entire application ecosystem, anchored on observable telemetry with AI- and ML-derived analytics that deliver relevant, impactful business context, has emerged as a modern, fit-for-purpose solution.

This involves a comprehensive view of application performance from the perspective of potential impact on the end user. It includes visibility across every touchpoint, including applications, end users, network, security, and the cloud. This approach allows teams to identify and understand the root causes of invisible downtime itself.

To achieve this, teams must leverage vast volumes of incoming telemetry data from the minute-by-minute operations of applications along with supporting infrastructure, and dependencies. This data offers insights into the state and health of applications, using observability solutions. Correlated to business objectives and transformed into actionable insights, it provides shared context that teams can use to calibrate and deliver exceptional, secure digital experiences, optimize for cost and performance, and maximize revenue.

Moreover, with these insights teams can prioritize which downtime issues to address first, based on their potential business impact. Using this information, they can more easily understand the downstream effects that performance issues have on application experiences, and ultimately on business metrics and outcomes.

Taking Aim at Invisible Downtime

Successfully combating invisible downtime requires an expanded view of the inputs and outputs of observable systems. Inputs include application and infrastructure stacks, while outputs include business transactions and user experiences.

A vast array of disparate monitoring tools are available, each with a limited view into the application stack.

Some monitor network and infrastructure performance based on static baseline metrics. Some provide visibility into applications but lack a full stack view of the network, infrastructure, and clouds. Others provide alerts on any type of "anomalous" behavior with little or no interpretation or prioritization. None of them effectively leverages traces to track the path of requests from the user to endpoints across distributed application topologies.

As teams deploy these different monitoring solutions, they encounter yet more IT complexity — and more limitations. Coupled with an already complicated cloud-native landscape, it is beyond human scale to collect, collate, and understand the signals hidden in vast streams of telemetry data. Arriving at meaningful insight soon enough to solve near real-time problems is nearly impossible.

Broad and contextualized observability across the entire service delivery chain and application stack is critical to resolving this dilemma, but alone it is not enough. Business context is the missing link to successfully correlating the technical performance of the IT stack — security, functions, applications, infrastructure, and operations — with business transactions and outcomes.

This type of observability makes it possible to sort through the deluge and, using advanced AI and ML analytics, automatically identify where and how problems are evolving across the entire IT estate so remediation efforts can be taken.

It shortens time between the action and reaction, placing application performance in a holistic frame rather than siloed across internal IT domains or isolated to line-of-business workflows. The result is an organization-wide reorientation toward applications and the digital experiences they empower, tying IT issues, security, and even compute and resource allocation, to proactive planning and business decisions.

Modern problems require modern solutions. Many organizations still have little or no visibility into the digital experiences of their end users, despite significant investment in meeting their uptime SLAs. They need comprehensive observability in their own context to put an end to invisible downtime.

Joe Byrne is CTO Advisor for Cisco Observability.

Comments

Plain text