At the pricing department in Zalando, we are predicting future demand for millions of articles on a daily basis by large-scale machine-learning models. These forecasts are key for discount decisions taken downstream. As evaluating every forecast on its own becomes infeasible at this scale and frequency we created a set of aggregated metrics that help us make informed statements about the performance of our models. On the one hand these metrics are being used by us to further improve our forecasting models, on the other hand they are used by our stakeholders to make informed decisions.

To handle this volume, we use PySpark for data processing and scaling our evaluations across the entire assortment. Furthermore, evaluating forecast performance in this context is crucial in two different scenarios, namely when analysing past forecast performance and when creating and comparing alternative models. In both cases we look at different time ranges and possible different subsets of the forecasted articles and calculate aggregated performance measures to compare them. We want to answer questions like

“Is this forecast performing better in low-discount periods than during sales events?”
“Did we make a higher error on highly discounted articles during last week?”
“Is this model well-suited to predict high (or low) selling articles?”
“Did our model perform well for sneakers during the last voucher event?”

Evaluating aggregated metrics like a relative mean squared error (MSE) or an mean absolute percentage error (MAPE) over different sets of articles has lots of pitfalls. Comparing different parts of the assortments leads to an "Apples vs. Oranges" problem that we want to elaborate on based on examples we experienced in our daily work.

To answer the questions above we developed a set of aggregated metrics that we monitor on a daily basis using plotly and streamlit for clear, interactive visualization. We want to present these metrics and explain how they are useful for the questions and tasks mentioned above. We will highlight the techniques and best practices to draw meaningful insights from evaluating forecast performance and how we are able to compare apples with oranges using meaningful lower bounds for our aggregated metrics.

We also want to share how observations from our monitoring influenced the evolution of our LightGBM and PyTorch models and how it shaped important parts like feature engineering, hyperparameter tuning and the choice of our loss functions. Lastly we will touch on how to communicate these sometimes very technical numbers with stakeholders so that they can make informed decisions without being overwhelmed by details.

Stefan Birr

Senior Applied Scientist at Zalando, working on developing large scale forecasting systems. Stefan holds a PhD in Mathematics from Ruhr University Bochum where his research focused on "Analyzing dynamic dependencies in time series. Prior to his 3 years at Zalando he worked for 5 years at E.ON as a Data Scientist creating algorithms for smart meter analytics and forecasting.

How to compare apples with oranges: Proper evaluation of article-level demand forecasts

Stefan Birr, Mones Raslan

Stefan Birr

Mones Raslan