Rediscovering single-node processing: When does it make sense to move from Spark to Polars?

Jonas Böer

Data Handling & Data Engineering
Python Skill Novice
Domain Expertise Intermediate

Apache Spark is the industry standard for big data processing, rightfully so. But for many data processing applications, a more light-weight solution will work just as well, avoiding Spark's compute and configuration overhead. Polars offers such a solution, with a fast single-node processing engine and a syntax that will pose no problems for experienced Spark developers. I will give a short comparison of Spark and Polars, where they have similarities and differences and show an implementation of a typical ETL and Feature Engineering task in both. I will compare the deployment, performance and cost of the two and, while giving my opinion on the topic, hope to enable you to also make an informed decision on when you want to use Polars and when to use Spark.

Jonas Böer

Data Engineer at inovex since 2022, full-time software engineer since 2018, coder for as long as I can remember. With my experience working on data warehouses and machine learning applications from small-scale tests up to international deployments, I enjoy eliminating bugs and bottlenecks, getting cool systems online and writing beautiful code. Still proud of the time when a colleague complained that deploying to production has become too easy and is no longer a thrilling adventure because of me.