We set out to replace an inefficient internal file format with an industry standard - a seemingly straightforward task. What we got instead was a descent into memory leak hell.

This talk will walk you through our journey of scaling DeepL's data preprocessing and model training pipelines to handle petabyte-scale corpora. When open-source C++-based Python libraries proved too unstable and memory-inefficient, we invested time and resources into developing our own Rust-based tooling and, compared to our previous internal file format, decreased memory load by a factor of 10 and latency until first byte read by a factor of 50.

What we'll cover: • Why Rust's memory safety guarantees matter in practice: We will provide a direct comparison of our results using C++-based vs Rust-based implementations for data processing libraries. • The Rust ecosystem advantage for Python interop: While C++ offers a fragmented landscape of build systems and tooling choices, Rust provides a canonical path with cargo, maturin, and PyO3—providing a clean interface for everything from GIL management to readable, zero-copy conversions between Rust and Python objects • Rust's surprisingly friendly features: Despite its reputation for having a steep learning curve, Rust offers language features that make it genuinely pleasant to work with, even for beginners coming from a Python background: from enums to pattern matching, error handling with Result, and cargo's canonical, ergonomic tooling. • Rust's impact on the arrow ecosystem and data engineering with Python in general: Besides the well-known impact that Rust-based data processing libraries like polars, Daft, and datafusion are having on the engineering ecosystem, we we will show how the Rust implementation of Arrow called arrow-rs is having a growing impact and expanding the data engineering toolkit by powering an increasing number of great and contributor-friendly processing and introspection tools built in Rust.

Jonas Dedden

Hi, I'm Jonas Dedden, Staff Research Data Engineer at DeepL SE, Germany. Johanna Goergen and I work at the Research Data Platform team of DeepL Research, where we are responsible for the on-prem & cloud-based k8s compute infrastructure for petabyte scale data processing pipelines. We provide the platform that our Research Data Engineers can use to collect & preprocess all data needed for training the DeepL foundational language models that power our production services.

Scaling Data Processing for Training Workloads at DeepL Research with Rust

Jonas Dedden, Johanna

Jonas Dedden

Johanna