What can millions of real Python code snippets tell us about how the language evolves? And why do the patterns we observe in Python look uncannily similar to patterns found in patents and scientific research — systems that seem to have nothing to do with software?
This talk begins with a practical challenge: extracting structured signals from the chaotic world of Stack Overflow. We built a pipeline that scanned posts for Python code blocks, identified import statements, normalised package names, filtered noise, and reconstructed a time-ordered stream of collections, each composed of the packages used in that snippet. From this, we derived two simple indicators of innovation: • new packages appearing for the first time, and • new package pairs appearing together for the first time.
Once these signals are extracted, a surprisingly coherent picture emerges. The Python ecosystem introduces brand-new packages less and less frequently over time, yet continues to generate new combinations of packages at a remarkably steady pace. Developers reuse familiar tools, but they also explore the space of possible pairings with a precision that looks — statistically — almost mechanical.
To understand just how surprising this is, we compare Python’s behavior with two very different worlds. The first is the US patent system, where technology codes assigned to inventions can be analyzed the same way we analyze Python imports. A classic 2015 study by Youn et al. showed that while new technology codes appear at a slowing rate, pairs of codes accumulate almost linearly over two centuries of innovation. The second is a corpus of physics publications, which behaves in much the same way when one treats subject classification codes as ingredients.
Across all three domains — software, science, and invention — the same pattern holds. Distinct components grow sublinearly (Heaps’ law), while distinct combinations grow close to linearly. This parallel is not only unexpected; it suggests that these systems share a deeper underlying mechanism, bound not by specific domain-specific details but by the very foundational patterns of human innovation.
In the second half of the talk, we introduce the concept of "adjacent possible" and demonstrate its modelling via a simple stochastic model: a Pólya urn extended with the adjacent possible. The model assumes only two forces: reinforcement of frequently used components and occasional introduction of new ones. Despite its simplicity, it reproduces the empirical behavior of all three systems without requiring domain-specific rules. It shows how a stable exploration–exploitation balance can arise naturally, leading to predictable rates of combinatorial novelty even in rapidly changing ecosystems.
The framework offers a new way to think about the ecosystem: not as a chaotic swarm of libraries, but as an innovation system governed by universal constraints. It sheds light on why certain libraries become dominant, why the combination space grows the way it does, and how the community collectively expands the “adjacent possible” of the language.
Attendees of the talk will learn: • how to extract meaningful innovation signals from real Python code at scale, • how to measure novelty and combinatorial creativity in software ecosystems, • why Python’s long-term evolution aligns with empirical laws from patents and science, • and how simple generative models can help reason about complex developer behavior.
The talk connects engineering, data analysis, and innovation theory to reveal an unexpected insight: Python grows the way many creative systems grow — slowly at the edges, rapidly in combinations, and always under the quiet guidance of reinforcement and the adjacent possible.