From Row-Wise to Columnar: Speeding Up PySpark UDFs with Arrow and Polars

Aimilios Tsouvelekakis

Data Handling & Data Engineering
Python Skill Intermediate
Domain Expertise Intermediate

Objective Demonstrate how to accelerate UDF-heavy PySpark workloads by switching from row-wise execution to Arrow-backed columnar execution, using Polars for fast, maintainable column transformations and table transformations.

Key Takeways

  • How Arrow is being used in PySpark for batched, columnar data exchange
  • Why Polars helps: a higher-level DataFrame API plus Arrow interoperability that can often reuse Arrow buffers
  • How to design fast column transformations (column in → column out) and fast table transformations (batch/table in → batch/table out).
  • Benchmarks and tradeoffs across scalar UDFs, Pandas UDFs, Arrow-native UDFs, and Polars-based Arrow table transforms on real-world examples.

Audience

  • Data engineers and data scientists working with PySpark at scale
  • Engineers seeking concrete strategies to optimize spark pipelines that rely on Python UDFs

Knowledge Expected

  • Familiarity with PySpark DataFrames and UDFs
  • Basic understanding of Spark execution helps but is not required
  • Exposure to Polars/Arrow is not required but might be beneficial

Aimilios Tsouvelekakis

Aimilios works as a software engineer for Frontiers Media SA. With a passion for solving technical challenges and a commitment to sharing his knowledge in different aspects of computer engineering, including but not limited to ETL pipelines and optimization, improving the in-house tooling, contributing to different architectural decisions, he makes a valuable contribution to his team's objectives. Prior to joining Frontiers, he gained experience working as a Devops engineer at CERN, where he actively contributed in projects related to cloud computing and disaster recovery, automation, observability and databases. He holds a MEng in Electrical and Computer Engineering from National Technical University of Athens.