This talk is a technical deep dive into the "physics" of distributed machine learning inference. While high-level APIs promise seamless integration between Spark (JVM) and Python, the underlying data transfer mechanisms often become the primary bottleneck for high-throughput systems. We start by reality-checking the "Zero-Copy" promise of Apache Arrow in a PySpark context, identifying exactly where the abstraction leaks and where "Zero-Copy" isn't actually free.
We will move beyond the code to look at the execution traces and memory profiles of a 6-billion-row inference job. Using Flame graphs, we will visualize the CPU time spent in serialization versus actual prediction across three scenarios: standard pandas_udf, optimized mapInPandas, and SynapseML. This forensic analysis reveals the hidden costs of pickling and the impact of JNI context switching when bypassing the Python Global Interpreter Lock (GIL).
The session concludes with a focus on tuning for throughput. We will explore the delicate balance of configuring spark.sql.execution.arrow.maxRecordsPerBatch, demonstrating how to find the "Goldilocks" zone that maximizes CPU saturation without causing JVM off-heap memory crashes. Attendees will gain a deep understanding of the memory hierarchy involved in distributed inference and practical strategies for profiling serialization overhead in production.
Key Takeaways:
Internals Knowledge: Understand exactly how data moves from JVM heap to Python worker memory.
Tuning Skills: Learn how to configure Apache Arrow batch sizes to optimize CPU saturation.
Profiling Techniques: Discover methods for profiling distributed Spark jobs to visualize serialization overhead.