This session introduces a practical, open-source solution to a critical challenge facing data engineers and scientists: how to proactively guarantee data quality. In today's fast-paced development cycles, data pipelines are increasingly complex and reliant on numerous upstream sources, elevating the risk of data quality issues that have the potential to cause production failures. While monitoring and alerting systems are essential for flagging these failures, they are fundamentally reactive; their value is entirely dependent on the quality and coverage of the underlying validation logic that engineers must build and maintain. The true goal is to shift from reactive clean-up to proactive prevention. This talk demonstrates a more effective approach: stopping bad data from ever reaching production by embedding clear, declarative validation directly into your data pipelines. This provides immediate visibility into errors, allowing you to catch and fix data quality issues at the earliest possible stage of development.
dataframe-expectations is an attempt to address this problem through a lightweight, open-source Python library designed for declarative data validation in both PySpark and Pandas. This session will explore the key design choices behind its implementation and architecture, including its lightweight nature, which ensures the library doesn't become a bottleneck by impacting CI/CD run times or bloating container image sizes, making it ideal for data pipelines, unit tests and end-to-end tests alike. Through examples, we will walk through its fluent, chainable API and showcase its extensive list of reusable, parameterized expectations. We will then dive into advanced features, including powerful decorator-based validation that seamlessly integrates quality checks into your existing code, and a flexible tag-based filtering system that allows you to dynamically decide which expectations to run at runtime.
Attendees will leave with a clear, actionable strategy for integrating declarative data quality checks into their pipelines, understanding how a simple, extensible tool can dramatically increase the reliability of their data products and, ultimately, their development velocity.