In this talk, we will walk through a concrete production-style example of an LLM-based agent that automatically classifies and routes incoming customer support tickets. The agent takes raw ticket text as input, predicts a priority label, and routes the ticket to the appropriate support queue. A human override is possible but expected to be rare.
At deployment time, the system performs well. Classification confidence is high, fallback usage is low, and manual corrections are infrequent. Over time, however, the environment changes: new products are launched, outages introduce new failure modes, terminology evolves, and internal definitions of ticket priorities shift. Nothing crashes, latency remains stable, and traditional service-level metrics stay green; yet the agent’s decisions slowly degrade.
This talk focuses on how to observe, measure, and act on that degradation.
Using recorded ticket data and a demo, I will show how to instrument an LLM-based agent with continuous evaluation signals, including:
I will demonstrate how these signals can be computed in rolling time windows, visualised on simple dashboards, and connected to alert thresholds. Rather than relying on a single accuracy number, the talk shows how multiple weak signals together reveal silent failure modes that would otherwise go unnoticed.
The focus is deliberately not on training new models or tuning prompts. Instead, we concentrate on operating LLM-based agents safely after deployment. You will see how to build a continuous evaluation pipeline, how to distinguish normal variation from meaningful drift, and how to decide when intervention is required whether that means retraining, prompt changes, label redefinition, or temporary rollback to human routing.
By the end of the talk, attendees will have a clear, practical blueprint for monitoring LLM-based agents in production and for detecting quiet, confident failure modes before they affect users or business operations.