Insight and analysis on the information technology space from industry thought leaders.

Recognizing Signs of Trouble in Your Kubernetes EnvironmentRecognizing Signs of Trouble in Your Kubernetes Environment

Traditional observability tools fall short in capturing Kubernetes' complexity; modern solutions must go beyond metrics and logs to deliver proactive, holistic management of cloud-native environments.

Industry Perspectives

January 17, 2025

4 Min Read

Alamy

By Itiel Shwartz, Komodor

As an IT leader overseeing cloud-native environments, you've likely invested heavily in monitoring and observability to track the health and performance of your applications and infrastructure. While many people think of Kubernetes as just another puzzle piece of modern infrastructure, it is in fact a highly complex environment consisting of thousands of moving parts.

The primary focus of observability tools is to track telemetry including metrics, events, logs, and traces — commonly known as MELT data. But is this traditional approach enough to manage the intricacies of Kubernetes (K8s) environments?

MELT: Great, But Not Good Enough

MELT data has long been key to observability, helping site reliability engineering (SRE) teams ensure availability and user experience. However, modern Kubernetes-based applications introduce complex, hidden layers of infrastructure. This makes them ill-suited for traditional observability approaches, which lack visibility into the full stack of Kubernetes infrastructure components that span workloads, native resources, and a complex ecosystem of add-ons (including popular CRDs and operators).

They also often lack native Kubernetes context, making it difficult to provide accurate insights into Kubernetes cluster behavior and application health. This leaves significant blind spots when it comes to understanding what's really happening within your clusters.

APMs (Application Performance Monitoring tools) were designed to monitor and manage the performance, availability, and health of applications. They provide insights into application behavior by tracking MELT data to identify and diagnose performance issues. To monitor an application's underlying infrastructure, whether on-prem or in the cloud, enterprises must use separate monitoring tools that provide data on uptime, errors, utilization, etc.

Kubernetes, meanwhile, sits between applications and their underlying infrastructure, serving as the essential base platform and container orchestrator. It doesn't just run applications but orchestrates containers across nodes and clusters, ensuring resource allocation, scaling, and uptime. By abstracting infrastructure complexities, Kubernetes enables efficient deployment and management, handling load balancing, service discovery, and updates to maintain performance and resilience in dynamic environments.

A single application failure might not stem from an issue within the app itself but from the underlying Kubernetes infrastructure — an overloaded node, a misconfigured network policy, or even a failed dependency in a third-party add-on, CRDs, or operator.

For example, a CPU spike in one container might cascade through the system, slowing or failing other workloads. Traditional tools might detect the spike but lack context, which forces engineers to manually investigate the issue and play catch-up instead of proactively addressing the root cause.

Beyond Observability

In modern cloud-native environments, the future of observability must evolve beyond MELT data and basic dashboards. Engineers need an automated, holistic approach that doesn't just provide raw data but intelligently correlates events, metrics, and signals across the entire Kubernetes stack discussed earlier.

This new approach to Kubernetes management can be likened to upgrading from a weather forecast to a full climate model. You don't just want to know it's raining in one part of the system; you need to understand how that rain might cause flooding in another area, potentially triggering a larger disaster. Modern management tools aim to provide this broader perspective by analyzing the interdependencies between components and predicting where trouble might arise.

Consider an e-commerce application experiencing SSL certificate errors preventing customers from accessing the site. Traditional APM might show HTTP 495/496 errors, but a holistic management approach could automatically correlate these failures with a failing cert-manager operator. This includes tracing the exact chain of events: the expired certificate triggering the SSL errors, the failed cert-manager renewal attempts, and the underlying ClusterIssuer connectivity issues — all in a single view.

Implementing Continuous Optimization

Managing Kubernetes is not a "set it and forget it" endeavor. The complexity and dynamic nature of these environments require teams to go beyond reactive problem-solving and adopt a proactive approach to optimization. This means continuously analyzing performance data, fine-tuning configurations, and evolving with the environment's changing demands.

This requires the ability to correlate signals across workloads, infrastructure, and add-ons, to achieve a complete picture of your ecosystem. These insights can be used to automate routine actions — like scaling underutilized resources or adjusting network policies — so engineers can focus on strategic improvements.

Another key practice is to prioritize observability into third-party dependencies, such as add-ons or CRDs. These often act as critical links in the Kubernetes stack, and blind spots here can cascade into system-wide issues. Proactively assessing the reliability and impact of these dependencies, coupled with automated alerts for failures, can mitigate risks before they disrupt operations.

The goal isn't just to maintain performance but to build resilience — an environment that not only detects issues but also provides actionable insights to prevent them, ultimately reducing downtime and enhancing user experience.

About the author:

Itiel Shwartz, CTO and co-founder of Komodor, is an expert in Kubernetes, cloud-native technologies, and infrastructure. He has served in technical leadership roles at eBay, Forter, and Rookout.

About the Author

Industry Perspectives

See more from Industry Perspectives

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

Recognizing Signs of Trouble in Your Kubernetes EnvironmentRecognizing Signs of Trouble in Your Kubernetes Environment

MELT: Great, But Not Good Enough

Beyond Observability

Implementing Continuous Optimization

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Related Topics

Recent in Cloud

Related Topics

Recent in OS

Related Topics

Recent in IT Mgmt

Related Topics

Recent in Career

Related Topics

Recent in Storage

Related Topics

Recent in Security

Related Topics

Recent in Dev

Related Topics

Recent in DX

Related Topics

Recent in Infrastructure

Related Topics

<span class="ArticleBase-LargeTitle">Recognizing Signs of Trouble in Your Kubernetes Environment</span>Recognizing Signs of Trouble in Your Kubernetes EnvironmentRecognizing Signs of Trouble in Your Kubernetes Environment

MELT: Great, But Not Good Enough

Beyond Observability

Implementing Continuous Optimization

About the Author

Editor's Choice

Featured Technical Explainers

Recent What Is

Recognizing Signs of Trouble in Your Kubernetes EnvironmentRecognizing Signs of Trouble in Your Kubernetes Environment