Ship Data with Confidence: Declarative Validation for PySpark & Pandas

Ryan Sequeira

Data Handling & Data Engineering
Python Skill Intermediate
Domain Expertise Intermediate

This session introduces a practical, open-source solution to a critical challenge facing data engineers and scientists: how to proactively guarantee data quality. In today's fast-paced development cycles, data pipelines are increasingly complex and reliant on numerous upstream sources, elevating the risk of data quality issues that have the potential to cause production failures. While monitoring and alerting systems are essential for flagging these failures, they are fundamentally reactive; their value is entirely dependent on the quality and coverage of the underlying validation logic that engineers must build and maintain. The true goal is to shift from reactive clean-up to proactive prevention. This talk demonstrates a more effective approach: stopping bad data from ever reaching production by embedding clear, declarative validation directly into your data pipelines. This provides immediate visibility into errors, allowing you to catch and fix data quality issues at the earliest possible stage of development.

dataframe-expectations is an attempt to address this problem through a lightweight, open-source Python library designed for declarative data validation in both PySpark and Pandas. This session will explore the key design choices behind its implementation and architecture, including its lightweight nature, which ensures the library doesn't become a bottleneck by impacting CI/CD run times or bloating container image sizes, making it ideal for data pipelines, unit tests and end-to-end tests alike. Through examples, we will walk through its fluent, chainable API and showcase its extensive list of reusable, parameterized expectations. We will then dive into advanced features, including powerful decorator-based validation that seamlessly integrates quality checks into your existing code, and a flexible tag-based filtering system that allows you to dynamically decide which expectations to run at runtime.

Attendees will leave with a clear, actionable strategy for integrating declarative data quality checks into their pipelines, understanding how a simple, extensible tool can dramatically increase the reliability of their data products and, ultimately, their development velocity.

Ryan Sequeira

As a Data Scientist on the Traveler Data Products team at GetYourGuide, I have spent the last 4 years developing and refining the ranking and relevance systems that power one of the world's leading travel experience platforms. My work is focused on enhancing the traveler's journey, helping millions discover and book their ideal experiences through data-driven solutions.

My path to data science is built on a foundation of diverse technical experience. I began my career in 2013 as a backend developer in Pune, India, before pursuing a Master's in Computer Science at the Indian Institute of Technology Patna, where I specialised in Network Science. Following my studies, I continued at the institute for two years as a research assistant, further honing my expertise in Network Science, which paved my way into the field of data science.

In 2021, I relocated to Berlin to join GetYourGuide, where I apply my software engineering background and machine learning skills to solve real-world problems at scale. This blend of backend development experience, academic research, and industry application gives me a unique perspective on building robust, production-ready data solutions.