Multimodal learning—systems that combine vision, language, audio, and other sensory inputs—has moved from a niche research topic to a central paradigm in modern machine learning. Today’s most influential models no longer operate on a single modality but instead learn rich representations by combining language with images, videos, sound. This shift has fundamentally changed how we build, train, and evaluate current machine learning systems. Python has played a decisive role in this transformation. Acting as a unifying layer across modalities, Python enabled researchers and practitioners to seamlessly combine computer vision, natural language processing, and speech within a single ecosystem. Python-based frameworks lowered the barriers between research communities, and accelerated the rise of large-scale, weakly supervised, and foundation models. However, this success has also introduced new challenges. The ease of experimentation masks growing issues around scalability, reproducibility, and evaluation. Multimodal systems increasingly depend on complex Python-based stacks whose abstractions can obscure underlying assumptions and costs. This keynote will reflect on the current state of multimodal learning, examine how Python shaped its trajectory, and critically discuss the technical and conceptual challenges that lie ahead aiming to provide a perspective on where machine learning in general and multimodal learning in particular is succeeding, where it is struggling, and what role the Python community can play in shaping its next phase.