This talk explores how observability can be applied to data pipelines to improve reliability, data quality, and confidence in complex data systems.
The talk begins with an introduction to observability in the context of data engineering. It explains the three core pillars: metrics, alarms, and logs, and discusses why observability is particularly important for data pipelines, where failures are often silent and correctness issues may only surface through stakeholder complaints. Examples include detecting issues early, debugging problems that span multiple systems, and gaining a better understanding of how data pipelines behave under changing load and requirements.
The first section focuses on metrics. It demonstrates how straightforward it can be to instrument data pipelines with basic metrics using Python. The talk then discusses which metrics are worth monitoring, adapting established concepts such as the four golden signals to data engineering use cases. Topics include white-box versus black-box monitoring, monitoring latency, throughput, error rates, and data freshness, and using metrics to identify long-term trends such as performance regressions after code changes. A concrete example based on a near–real-time event processing pipeline illustrates how fine-grained metrics can reveal systematic failures for specific event types.
The second section focuses on alerting. It addresses the challenge that engineers rarely have time to continuously inspect dashboards and therefore rely on alarms to surface important issues. The talk outlines what makes a good alarm, emphasizing that alarms should be actionable, reliable, and provide sufficient context for investigation. A scenario involving a complex system with excessive and noisy alarms is used to illustrate alarm fatigue and the normalization of deviance. The section shows how to reduce noise by identifying critical system components, removing low-value alarms, and gradually refining alerting based on a clear understanding of which failures are unacceptable.
The final technical section covers logging. It discusses why logs are often difficult to work with in data pipelines, as they may contain a mixture of critical errors, informational messages, and low-level framework output. The talk introduces structured logging as a way to add context and make logs easier to search, filter, and aggregate. Examples include monitoring the distribution of log levels to uncover hidden issues, tracing the processing of individual records or users through a pipeline, and using centralized logging to identify dependencies between systems that are otherwise hard to detect.
The talk concludes by emphasizing that observability is not a one-time effort but an evolving practice. Attendees are encouraged to start with small, high-impact improvements and adapt their observability setup as their data pipelines and organizational needs grow.