Testing traditional software is "simple"... same input, same output. LLMs? Not so much. Same prompt, different result every time. So how do you actually know if your AI product is good?

Spoiler: Most teams don't. They ship on vibes and hope for the best.

This talk takes you through our real journey at Blue Yonder, where we built an LLM-powered analytics system and needed a way to actually measure its quality. You'll see how we went from "feels okay-ish" to concrete numbers that let us make real decisions - with actual examples from production along the way.

The methodology is called Error Analysis: collect traces, annotate them from the user's perspective, group similar issues into failure modes, and turn those into automated evals. Along the way, we'll share practical best practices like why binary Pass/Fail beats rating scales, and why 100% pass rate means your evals are broken.

The payoff? When a new model drops, we run our pipeline and know within hours - not weeks - whether it's better or worse for our specific use case. Real percentages. Real trade-offs. Real decisions.

Expect a meme-powered walkthrough and a clear path to implement this yourself starting with just 20 traces.

Outline:

Introduction: The challenge of testing stochastic systems, why we needed a better approach
Collecting and Annotating Traces: Every trace is a user experiencing your product, Open Coding from the user perspective, real examples of failure modes we discovered
Building the Failure Taxonomy: Grouping observations into categories, Axial Coding, turning scattered comments into actionable failure modes
Writing Evals That Work: LLM-as-judge setup, binary scores vs rating scales, validating against human judgment
From Vibes to Decisions: Prioritizing what to fix, measuring improvement, 24-hour model benchmarking
Wrap-up: Your action plan, start with 20 traces

Martin Seeler

Martin Seeler supercharges global supply chains with GenAI as Sr Staff AI Engineer at Blue Yonder. He ships AI that survives angry customers, skeptical executives, and Black Friday traffic. Speaks globally about the messy reality of production AI. Measures success in customer value delivered.

AI Evals Done Right: From Vibes to Confident Decisions

Martin Seeler

Martin Seeler