You've deployed an LLM application. Your tests show that it's working. The metrics look good. Then a user says it's broken.
This happens more often than you would expect.
In this talk, we'll share our experience of building and maintaining LLM applications, and discuss what we've learned about the discrepancy between evaluation results and user experience.
We will explore three dimensions of evaluation through the lens of user experience:
Sometimes the gap between tests and reality comes down to expectations. Questions that seem obviously hard to users turn out to be easy for the LLM—and vice versa. Understanding this mismatch is the first step to building systems that users actually trust.
When you're working with LLMs, individual components might pass tests while the whole system fails. With prompts, model parameters, evaluation criteria, metadata, and ever-growing datasets all interacting, the complexity compounds quickly.
In this section, we'll share practical lessons from operating LLM applications in production: how we use observability tools like Opik to monitor model behavior, how telemetry helps us understand actual usage patterns, and how dedicated validation endpoints allow us to detect issues in on-premises deployments before users do.
We'll discuss real-life scenarios we've encountered, such as when users expected different results to those delivered by our system, when external changes affected the system silently, and when performance drifted in ways that our metrics didn't detect.
This isn't a talk about frameworks or tools (even though we'll mention a few). It's about the human element of evaluation: ensuring that the system we built serves the people using it.
Whether you're just starting out with LLM applications or running them at scale, you'll probably recognize these scenarios. We'll share the strategies and patterns that we've developed, not as prescriptive rules, but as a starting point for your own approach.