Objective Demonstrate how to accelerate UDF-heavy PySpark workloads by switching from row-wise execution to Arrow-backed columnar execution, using Polars for fast, maintainable column transformations and table transformations.

Key Takeways

How Arrow is being used in PySpark for batched, columnar data exchange
Why Polars helps: a higher-level DataFrame API plus Arrow interoperability that can often reuse Arrow buffers
How to design fast column transformations (column in → column out) and fast table transformations (batch/table in → batch/table out).
Benchmarks and tradeoffs across scalar UDFs, Pandas UDFs, Arrow-native UDFs, and Polars-based Arrow table transforms on real-world examples.

Audience

Data engineers and data scientists working with PySpark at scale
Engineers seeking concrete strategies to optimize spark pipelines that rely on Python UDFs

Knowledge Expected

Familiarity with PySpark DataFrames and UDFs
Basic understanding of Spark execution helps but is not required
Exposure to Polars/Arrow is not required but might be beneficial

Aimilios Tsouvelekakis

Aimilios works as a software engineer for Frontiers Media SA. With a passion for solving technical challenges and a commitment to sharing his knowledge in different aspects of computer engineering, including but not limited to ETL pipelines and optimization, improving the in-house tooling, contributing to different architectural decisions, he makes a valuable contribution to his team's objectives. Prior to joining Frontiers, he gained experience working as a Devops engineer at CERN, where he actively contributed in projects related to cloud computing and disaster recovery, automation, observability and databases. He holds a MEng in Electrical and Computer Engineering from National Technical University of Athens.

From Row-Wise to Columnar: Speeding Up PySpark UDFs with Arrow and Polars

Aimilios Tsouvelekakis

Aimilios Tsouvelekakis