Objective
Demonstrate how to accelerate UDF-heavy PySpark workloads by switching from row-wise execution to Arrow-backed columnar execution, using Polars for fast, maintainable column transformations and table transformations.
Key Takeways
- How Arrow is being used in PySpark for batched, columnar data exchange
- Why Polars helps: a higher-level DataFrame API plus Arrow interoperability that can often reuse Arrow buffers
- How to design fast column transformations (column in → column out) and fast table transformations (batch/table in → batch/table out).
- Benchmarks and tradeoffs across scalar UDFs, Pandas UDFs, Arrow-native UDFs, and Polars-based Arrow table transforms on real-world examples.
Audience
- Data engineers and data scientists working with PySpark at scale
- Engineers seeking concrete strategies to optimize spark pipelines that rely on Python UDFs
Knowledge Expected
- Familiarity with PySpark DataFrames and UDFs
- Basic understanding of Spark execution helps but is not required
- Exposure to Polars/Arrow is not required but might be beneficial
Aimilios Tsouvelekakis
Aimilios works as a software engineer for Frontiers Media SA. With a passion for solving technical challenges and a commitment to sharing his knowledge in different aspects of computer engineering, including but not limited to ETL pipelines and optimization, improving the in-house tooling, contributing to different architectural decisions, he makes a valuable contribution to his team's objectives. Prior to joining Frontiers, he gained experience working as a Devops engineer at CERN, where he actively contributed in projects related to cloud computing and disaster recovery, automation, observability and databases. He holds a MEng in Electrical and Computer Engineering from National Technical University of Athens.