Fight your garbage data: implementation of a pythonic data quality monitoring framework in PySpark

Rostislaw Krassow, Joshua Finger

Data Handling & Data Engineering
Python Skill Intermediate
Domain Expertise Intermediate

In the talk we share our expirience from the project implemented in Q3 2025. We start with the motivation for the project, involved stakeholders and their needs. We will then define the criteria for a successful data quality monitoring solution and share findings from our evaluation of existing frameworks. We will also discuss why popular frameworks like Great Expectations or SODA did not meet our requirements.

Next, we will demonstrate our implementation based on DQX—a lightweight, open-source Python library designed for traceable, row-level data quality checks before and after data is persisted. DQX, developed and maintained by Databricks labs, allows developers to concentrate on the core implementation while providing business users YAML files for maintenance of business rules. Furthermore, DQX’s seamless integration with PySpark enables efficient and cost-effective quality monitoring within our IoT data lake.

Finally, we move beyond the code to the organisational reality. We will discuss how we embedded Data Quality Monitor into the organisation and share our opinion on the hard questions: who is responsible for maintaining rules? who monitors the results?

Talk outline

  • Motivation for the project

    • Initial situation and objectives
  • Framework evaluation

    • Evaluation criteria for a successful data quality monitoring

    • Comparison of available frameworks

  • Our implementation with DQX

    • How to use built-in data quality checks

    • How to add custom data quality checks

    • Automated rule generation with DQX Profiler

    • Output and visualisation options

    • Python project structure

  • Embedding in organisation

    • Rule maintenance

    • How to communicate data quality issues

  • Summary

Key takeaways

  • Understanding of most important criteria when choosing the framework for data quality monitoring from perspective of a data engineer and an architect

  • Understanding of DQX framework

  • Ideas how to integrate data quality monitoring into organisations

Rostislaw Krassow

Rostislaw, a data architect at RATIONAL AG, specializes in distributed databases, the Apache Hadoop ecosystem and Azure cloud. He leverages his expertise to maintain the enterprise Data & Analytics platform for IoT data, where his daily work involves reconciling diverse stakeholder perspectives to deliver sustainable solutions.

Joshua Finger

I'm a data engineer working at inovex GmbH with a full-stack software engineering background and a master degree in computer science.