Evaluating AI Agents: Building Reliable AI Applications with Kotlin & Spring AI

In this talk from the Kotlin Meetup Rotterdam, senior software engineer and trainer Peter explains why AI agent applications fail in production so often — and what you can do about it. While traditional software relies on deterministic systems with clear pass/fail tests, AI agents are inherently probabilistic: LLMs hallucinate, embedding models return inconsistent results, and tool calls aren't guaranteed. That demands a completely new approach to quality assurance.

Peter introduces Eval-Driven Development (EDD) as the answer: a methodology that brings TDD into the AI era. Using live code in Kotlin and Spring AI, he demonstrates how to set up an evaluation harness, define quality criteria, use LLM-as-Judge, and feed production data back into your test suite. The talk also covers observability (Langfuse), red teaming, and user feedback loops.

00:00 Introduction — who is Peter?
02:30 Why AI projects fail: stats, the vending machine benchmark & the sorcerer's apprentice
06:20 Traditional testing vs. AI agents: deterministic vs. probabilistic systems
13:30 Introducing Eval-Driven Development (EDD): accuracy, cost & latency
16:20 Step 1: Define goals, users, scenarios and a Minimum Viable Evaluation
21:20 Live demo: Kotlin eval framework — contains evaluator & LLM-as-Judge
26:00 Live demo: advanced evaluators — RAG, hallucination, tool calls & conversation simulation
35:10 Calibrating judges: off-the-shelf vs. manual labeling
36:00 Security: red teaming and LLM vulnerabilities
37:20 In production: observability (Langfuse), monitoring & user feedback loops
41:20 Conclusion: build your harness from day one
42:40 Q&A: go-live baseline, red teaming costs, judge bias & GDPR

Видео Evaluating AI Agents: Building Reliable AI Applications with Kotlin & Spring AI канала Maqqie