We set out to replace an inefficient internal file format with an industry standard - a seemingly straightforward task. What we got instead was a descent into memory leak hell.
This talk will walk you through our journey of scaling DeepL's data preprocessing and model training pipelines to handle petabyte-scale corpora. When open-source C++-based Python libraries proved too unstable and memory-inefficient, we invested time and resources into developing our own Rust-based tooling and, compared to our previous internal file format, decreased memory load by a factor of 10 and latency until first byte read by a factor of 50.
What we'll cover: • Why Rust's memory safety guarantees matter in practice: We will provide a direct comparison of our results using C++-based vs Rust-based implementations for data processing libraries. • The Rust ecosystem advantage for Python interop: While C++ offers a fragmented landscape of build systems and tooling choices, Rust provides a canonical path with cargo, maturin, and PyO3—providing a clean interface for everything from GIL management to readable, zero-copy conversions between Rust and Python objects • Rust's surprisingly friendly features: Despite its reputation for having a steep learning curve, Rust offers language features that make it genuinely pleasant to work with, even for beginners coming from a Python background: from enums to pattern matching, error handling with Result, and cargo's canonical, ergonomic tooling. • Rust's impact on the arrow ecosystem and data engineering with Python in general: Besides the well-known impact that Rust-based data processing libraries like polars, Daft, and datafusion are having on the engineering ecosystem, we we will show how the Rust implementation of Arrow called arrow-rs is having a growing impact and expanding the data engineering toolkit by powering an increasing number of great and contributor-friendly processing and introspection tools built in Rust.