Insight and analysis on the information technology space from industry thought leaders.
Recognizing Signs of Trouble in Your Kubernetes EnvironmentRecognizing Signs of Trouble in Your Kubernetes Environment
Traditional observability tools fall short in capturing Kubernetes' complexity; modern solutions must go beyond metrics and logs to deliver proactive, holistic management of cloud-native environments.
January 17, 2025
By Itiel Shwartz, Komodor
As an IT leader overseeing cloud-native environments, you've likely invested heavily in monitoring and observability to track the health and performance of your applications and infrastructure. While many people think of Kubernetes as just another puzzle piece of modern infrastructure, it is in fact a highly complex environment consisting of thousands of moving parts.
The primary focus of observability tools is to track telemetry including metrics, events, logs, and traces — commonly known as MELT data. But is this traditional approach enough to manage the intricacies of Kubernetes (K8s) environments?
MELT: Great, But Not Good Enough
MELT data has long been key to observability, helping site reliability engineering (SRE) teams ensure availability and user experience. However, modern Kubernetes-based applications introduce complex, hidden layers of infrastructure. This makes them ill-suited for traditional observability approaches, which lack visibility into the full stack of Kubernetes infrastructure components that span workloads, native resources, and a complex ecosystem of add-ons (including popular CRDs and operators).
They also often lack native Kubernetes context, making it difficult to provide accurate insights into Kubernetes cluster behavior and application health. This leaves significant blind spots when it comes to understanding what's really happening within your clusters.
APMs (Application Performance Monitoring tools) were designed to monitor and manage the performance, availability, and health of applications. They provide insights into application behavior by tracking MELT data to identify and diagnose performance issues. To monitor an application's underlying infrastructure, whether on-prem or in the cloud, enterprises must use separate monitoring tools that provide data on uptime, errors, utilization, etc.
Kubernetes, meanwhile, sits between applications and their underlying infrastructure, serving as the essential base platform and container orchestrator. It doesn't just run applications but orchestrates containers across nodes and clusters, ensuring resource allocation, scaling, and uptime. By abstracting infrastructure complexities, Kubernetes enables efficient deployment and management, handling load balancing, service discovery, and updates to maintain performance and resilience in dynamic environments.
A single application failure might not stem from an issue within the app itself but from the underlying Kubernetes infrastructure — an overloaded node, a misconfigured network policy, or even a failed dependency in a third-party add-on, CRDs, or operator.
For example, a CPU spike in one container might cascade through the system, slowing or failing other workloads. Traditional tools might detect the spike but lack context, which forces engineers to manually investigate the issue and play catch-up instead of proactively addressing the root cause.
Beyond Observability
In modern cloud-native environments, the future of observability must evolve beyond MELT data and basic dashboards. Engineers need an automated, holistic approach that doesn't just provide raw data but intelligently correlates events, metrics, and signals across the entire Kubernetes stack discussed earlier.
This new approach to Kubernetes management can be likened to upgrading from a weather forecast to a full climate model. You don't just want to know it's raining in one part of the system; you need to understand how that rain might cause flooding in another area, potentially triggering a larger disaster. Modern management tools aim to provide this broader perspective by analyzing the interdependencies between components and predicting where trouble might arise.
Consider an e-commerce application experiencing SSL certificate errors preventing customers from accessing the site. Traditional APM might show HTTP 495/496 errors, but a holistic management approach could automatically correlate these failures with a failing cert-manager operator. This includes tracing the exact chain of events: the expired certificate triggering the SSL errors, the failed cert-manager renewal attempts, and the underlying ClusterIssuer connectivity issues — all in a single view.
Implementing Continuous Optimization
Managing Kubernetes is not a "set it and forget it" endeavor. The complexity and dynamic nature of these environments require teams to go beyond reactive problem-solving and adopt a proactive approach to optimization. This means continuously analyzing performance data, fine-tuning configurations, and evolving with the environment's changing demands.
This requires the ability to correlate signals across workloads, infrastructure, and add-ons, to achieve a complete picture of your ecosystem. These insights can be used to automate routine actions — like scaling underutilized resources or adjusting network policies — so engineers can focus on strategic improvements.
Another key practice is to prioritize observability into third-party dependencies, such as add-ons or CRDs. These often act as critical links in the Kubernetes stack, and blind spots here can cascade into system-wide issues. Proactively assessing the reliability and impact of these dependencies, coupled with automated alerts for failures, can mitigate risks before they disrupt operations.
The goal isn't just to maintain performance but to build resilience — an environment that not only detects issues but also provides actionable insights to prevent them, ultimately reducing downtime and enhancing user experience.
About the author:
Itiel Shwartz, CTO and co-founder of Komodor, is an expert in Kubernetes, cloud-native technologies, and infrastructure. He has served in technical leadership roles at eBay, Forter, and Rookout.
About the Author
You May Also Like