It Works on My Machine: Why LLM Apps Fail Users (Not Tests)

Thomas Prexl, Frank Rust

Natural Language Processing & Audio (incl. Generative AI NLP)
Python Skill Intermediate
Domain Expertise Intermediate

You've deployed an LLM application. Your tests show that it's working. The metrics look good. Then a user says it's broken.

This happens more often than you would expect.

In this talk, we'll share our experience of building and maintaining LLM applications, and discuss what we've learned about the discrepancy between evaluation results and user experience.

We will explore three dimensions of evaluation through the lens of user experience:

Expectations: What does 'working' actually mean to your users?

Sometimes the gap between tests and reality comes down to expectations. Questions that seem obviously hard to users turn out to be easy for the LLM—and vice versa. Understanding this mismatch is the first step to building systems that users actually trust.

Functional: Does the system do what it's supposed to do?

When you're working with LLMs, individual components might pass tests while the whole system fails. With prompts, model parameters, evaluation criteria, metadata, and ever-growing datasets all interacting, the complexity compounds quickly.

Operational: Does it remain reliable in real-world conditions?

In this section, we'll share practical lessons from operating LLM applications in production: how we use observability tools like Opik to monitor model behavior, how telemetry helps us understand actual usage patterns, and how dedicated validation endpoints allow us to detect issues in on-premises deployments before users do.

We'll discuss real-life scenarios we've encountered, such as when users expected different results to those delivered by our system, when external changes affected the system silently, and when performance drifted in ways that our metrics didn't detect.

This isn't a talk about frameworks or tools (even though we'll mention a few). It's about the human element of evaluation: ensuring that the system we built serves the people using it.

Whether you're just starting out with LLM applications or running them at scale, you'll probably recognize these scenarios. We'll share the strategies and patterns that we've developed, not as prescriptive rules, but as a starting point for your own approach.

Outline

  1. Why users report the LLM application is broken while it passes every test
  2. Three dimensions of the problem
    • Expectations
    • Functional
    • Operational
  3. Real-life scenarios
  4. Our current strategies and patterns
  5. Evaluation = understanding if the system serves users, not proving it's good

Thomas Prexl

Thomas builds LLM applications that create business impact. He co-founded neunzehn innovations GmbH to bring generative AI into companies that need it.

Before that, he ran startup support in Heidelberg—designing accelerators, connecting founders with money and know-how, and launching events like Neurons & Neckar, Sensors & Data Hackathon, and Startup Weekend Rhein-Neckar. Earlier: marketing and business development in electrical engineering and diagnostics.

He studied at Mannheim, got his doctorate at Basel, teaches at both Heidelberg and Mannheim, and talks about AI when someone asks him to.

Frank Rust

Frank is deeply passionate about technological advancements and a co-founder of neunzehn innovations, a company specializing in AI solutions. His professional background combines entrepreneurial experience—having established an innovation and strategy consultancy focused on strategy and deep tech—with several years at a major software corporation. Throughout his tenure in the software industry, he contributed to multiple product and service launches, working across various teams to bring new offerings to market. Outside the office, he enjoys discovering new horizons in the camper van.