Testing traditional software is "simple"... same input, same output. LLMs? Not so much. Same prompt, different result every time. So how do you actually know if your AI product is good?
Spoiler: Most teams don't. They ship on vibes and hope for the best.
This talk takes you through our real journey at Blue Yonder, where we built an LLM-powered analytics system and needed a way to actually measure its quality. You'll see how we went from "feels okay-ish" to concrete numbers that let us make real decisions - with actual examples from production along the way.
The methodology is called Error Analysis: collect traces, annotate them from the user's perspective, group similar issues into failure modes, and turn those into automated evals. Along the way, we'll share practical best practices like why binary Pass/Fail beats rating scales, and why 100% pass rate means your evals are broken.
The payoff? When a new model drops, we run our pipeline and know within hours - not weeks - whether it's better or worse for our specific use case. Real percentages. Real trade-offs. Real decisions.
Expect a meme-powered walkthrough and a clear path to implement this yourself starting with just 20 traces.
Outline: