Multimodal learning—systems that combine vision, language, audio, and other sensory inputs—has moved from a niche research topic to a central paradigm in modern machine learning. Today’s most influential models no longer operate on a single modality but instead learn rich representations by combining language with images, videos, sound. This shift has fundamentally changed how we build, train, and evaluate current machine learning systems. Python has played a decisive role in this transformation. Acting as a unifying layer across modalities, Python enabled researchers and practitioners to seamlessly combine computer vision, natural language processing, and speech within a single ecosystem. Python-based frameworks lowered the barriers between research communities, and accelerated the rise of large-scale, weakly supervised, and foundation models. However, this success has also introduced new challenges. The ease of experimentation masks growing issues around scalability, reproducibility, and evaluation. Multimodal systems increasingly depend on complex Python-based stacks whose abstractions can obscure underlying assumptions and costs. This keynote will reflect on the current state of multimodal learning, examine how Python shaped its trajectory, and critically discuss the technical and conceptual challenges that lie ahead aiming to provide a perspective on where machine learning in general and multimodal learning in particular is succeeding, where it is struggling, and what role the Python community can play in shaping its next phase.

Hilde Kühne

Prof. Dr. Hilde Kuehne is a Professor of Multimodal Learning at the Tübingen AI Center and an affiliated professor at the MIT–IBM Watson AI Lab. Previously, she was a Professor of Computer Vision and Multimodal Learning at the University of Bonn. She received her PhD from the cv:hci lab at the Karlsruhe Institute of Technology (KIT), where she was supervised by Rainer Stiefelhagen, and subsequently held postdoctoral positions at Fraunhofer FKIE and in the Computer Vision Group led by Prof. Jürgen Gall. Her research focuses on video understanding, with a particular emphasis on learning without labels and multimodal video understanding. She has created several highly cited datasets and foundational works for analyzing large collections of untrimmed video data, including HMDB51, which was awarded both the ICCV 2021 Helmholtz Prize and the PAMI Mark Everingham Prize.

The Multimodal Era of Machine Learning (and How Python Made It Possible)

Hilde Kühne

Hilde Kühne