In this talk, we will walk through a concrete production-style example of an LLM-based agent that automatically classifies and routes incoming customer support tickets. The agent takes raw ticket text as input, predicts a priority label, and routes the ticket to the appropriate support queue. A human override is possible but expected to be rare.

At deployment time, the system performs well. Classification confidence is high, fallback usage is low, and manual corrections are infrequent. Over time, however, the environment changes: new products are launched, outages introduce new failure modes, terminology evolves, and internal definitions of ticket priorities shift. Nothing crashes, latency remains stable, and traditional service-level metrics stay green; yet the agent’s decisions slowly degrade.

This talk focuses on how to observe, measure, and act on that degradation.

Using recorded ticket data and a demo, I will show how to instrument an LLM-based agent with continuous evaluation signals, including:

Tracking class-probability entropy over time to detect increasing uncertainty
Monitoring the rate of “unknown” or fallback predictions as an early warning signal
Measuring embedding distribution drift between historical and recent tickets
Quantifying disagreement between current agent decisions and historical routing outcomes or human corrections

I will demonstrate how these signals can be computed in rolling time windows, visualised on simple dashboards, and connected to alert thresholds. Rather than relying on a single accuracy number, the talk shows how multiple weak signals together reveal silent failure modes that would otherwise go unnoticed.

The focus is deliberately not on training new models or tuning prompts. Instead, we concentrate on operating LLM-based agents safely after deployment. You will see how to build a continuous evaluation pipeline, how to distinguish normal variation from meaningful drift, and how to decide when intervention is required whether that means retraining, prompt changes, label redefinition, or temporary rollback to human routing.

By the end of the talk, attendees will have a clear, practical blueprint for monitoring LLM-based agents in production and for detecting quiet, confident failure modes before they affect users or business operations.

Asya Melnik

I started as a data scientist, building ML microservices and deploying models into production. I later moved into a consulting role, where I helped adapt ML models to real customer needs, translate business problems into measurable objectives, interpret results, and monitor model performance over time.

Over the years, my work gradually shifted towards GenAI. I now design and build AI agents from scratch for internal process optimisation, support colleagues in adopting GenAI and agentic AI responsibly, and promote security-aware practices in solution development. A large part of my work focuses on evaluating and monitoring agent behaviour in real environments to ensure these systems remain useful, safe, and trustworthy after deployment.

The Day the Agent Started Lying (Politely)

Asya Melnik

Asya Melnik