How to Trust Your Telemetry Data

In the era of digitization, the surge in data has been unparalleled. At the heart of this information overload is telemetry data, comprising logs, metrics, and traces generated from the many systems, machines, applications, and services that power digital operations. Telemetry data provides a constant stream of information, such as performance metrics, user sentiment, signals for workflow bottlenecks, and digital signatures of bad actors, that impacts operational efficiency and holds valuable business insights.

However, organizations face a fundamental challenge as the volume of telemetry data grows. The cost and complexity of handling this data increase disproportionately to the value it delivers. This discrepancy poses a critical problem, particularly in cybersecurity and observability.

The unique nature of telemetry data compounds this challenge. Telemetry data is dynamic, continuously growing, and always changing with sporadic spikes. Additionally, enterprises struggle to deliver the right data confidently over time. There is uncertainty about the content of the data, concerns about completeness, and worries about sending sensitive PII information in data streams. These factors reduce trust in the collected and distributed data. Moreover, needs and access requirements can change based on evolving circumstances. What might seem insignificant during normal operations becomes crucial during a security breach or system performance incident. This unpredictability necessitates a paradigm shift in how organizations manage and extract value from telemetry data.

To effectively grapple with the complexities of telemetry data, organizations must adopt a strategic framework consisting of three phases: understanding, optimizing, and responding to telemetry data. The understanding phase involves knowing the origin, content, and process of pulling signals out of noise from telemetry data. Optimization includes reducing costs and increasing insights by selectively filtering, routing, and transforming data. Data drift, code changes, or any security incident may require you to adjust the data you collect and how you transform it. The responding phase requires swift adaptation to such data changes or incidents by adjusting data streams for analysis. This framework offers a systematic approach to managing the life cycle of data.

Understanding Telemetry Data

For modern IT and security operations, telemetry pipelines bridge the gaps between telemetry data and downstream tools and/or data lakes. They enable teams to transform, control, enrich, and route data, which helps to optimize costs, enhance data utility, and ensure compliance.

However, managing telemetry pipelines for systems developed by many different teams becomes an issue, particularly for security or site reliability engineers (SREs) responsible for maintaining service reliability or protecting the company. As other teams push changes, the logs and metrics within telemetry data shift. Understanding incoming data is crucial to efficiently manage systems and adhere to service-level agreements (SLAs).

To address this challenge, profiling the data helps intelligently analyze and identify patterns to achieve valuable insights. This assessment evaluates data schema and taxonomy for repeatable and redundant patterns. It also involves detecting drift and anomalies and managing such changes effectively. Data profiling enables organizations to answer strategic questions about data comprehension and duplicate data management, ensuring that telemetry data becomes a strategic asset throughout its life cycle.

Optimizing for Efficiency

After gaining a solid understanding of the data, the focus shifts to optimization. Noisy data incurs higher costs. Techniques such as identifying unstructured telemetry data, eliminating low-value and repetitive information, and implementing sampling to reduce unnecessary chatter help control data volume and costs. Intelligent routing rules enhance efficiency by directing specific data types to low-cost storage solutions.

Key aspects of optimization include reducing redundant logs, converting individual event log messages into metrics, processing traces to streamline downstream processing, and making informed decisions about what data to drop or sample. Additionally, determining when to rehydrate data and adjust sampling rates based on the operational context, whether in a normal state or a heightened threat environment, is essential.

The optimization phase may induce anxiety as teams fear over-optimizing and overlooking crucial elements. Acknowledging the human element in decision-making and risk management becomes integral to navigating this phase successfully.

Responding to Changing Conditions

To promptly address incidents and adapt to changing conditions, SRE teams need agile telemetry pipelines that deliver the right data at the right time.

For example, a change in context might require adjustments to the collected data. To respond effectively, SRE teams need the ability to switch between normal (reduced data volume) and incident (full-fidelity data) modes, such as turning off error logs and enabling debug logs. This involves using state routers and handlers to dynamically alter the data flow based on the system's current state. Whether triggered through predefined thresholds or an API call from an observability tool, the seamless transition between states enables the telemetry pipeline to capture and process necessary data during critical moments.

Another important aspect of the responding phase is contextualizing data over time. This temporal understanding helps teams analyze events leading up to an incident, providing a comprehensive system view. Incorporating event-driven triggers, whether internally or integrated with external tools, adds another layer of intelligence to the system. The live nature of data in motion requires continuous updates and adaptability. Implementing alerting mechanisms helps teams stay aware of changes in data patterns, ensuring a prompt response to evolving situations.

Controlling Telemetry Data with Confidence

Telemetry pipelines are crucial in observability, allowing teams to understand, optimize, and respond to evolving data landscapes. By gaining a clear understanding of telemetry data, optimizing for efficiency, and building responsiveness, teams can trust their data and navigate the complexities of modern system management with agility and intelligence. With such confidence and control, new opportunities to process the data in the stream are presented — but that's a topic for another day.

About the author:

Tucker Callaway is the CEO of Mezmo. He has more than 20 years of experience in enterprise software, with an emphasis on developer and DevOps tools. He is responsible for driving Mezmo's growth across all revenue streams and creating the foundation for future revenue streams and go-to-market strategies. He joined Mezmo in January 2020 as president and CRO and took the torch as CEO six months later. Prior to Mezmo, he served as CRO of Sauce Labs and as vice president of worldwide sales at Chef.

Comments

Plain text