How To Debug AI Agents: Tracing, Observability & Evals

When you build an AI agent using instructions, LLMs, and tools, it quickly becomes a black box. Your traditional unit tests pass, yet silent failures sneak straight through to your users in production. In this technical walkthrough, we break down how to achieve full production visibility using open-source observability, agent tracing, and non-deterministic evaluations (evals).

Using a real-world procurement-approval agent built with the Microsoft Agent Framework and Azure OpenAI on Microsoft Foundry, you’ll discover how to transition from guessing to data-driven orchestration.

Learn how to:
Implement OpenInference: Extend standard OpenTelemetry semantic conventions to capture AI-specific data like tool calls, tokens, prompt variations, and exact costs.
Visualize with Phoenix: Stream live traces into the free, open-source AI observability platform to track complex multi-turn workflows.

Deploy LLM-as-a-Judge: Construct robust grounding checks using an AI judge to evaluate agent decisions at scale when manual human validation is impossible.
Automate Self-Improving Loops: Leverage evaluation harnesses alongside coding agents (like Claude Code and Copilot CLI) to systematically iterate on prompts and watch your pass rate climb from 40% to 90%.

Chapters:
00:00 Why your AI agent is a black box
00:26 The example: a procurement agent in Microsoft Agent Framework
02:17 Why OpenTelemetry alone isn't enough for AI
02:49 OpenInference: OTEL semantic conventions for agents
04:43 Reading a real agent trace in Phoenix
08:40 The harder question: is your agent actually working?
10:43 Evals 101: code evals vs. LLM-as-a-judge
12:10 Building a grounding-check judge
14:00 Reading the eval results (we only pass 40% of the time)
15:48 Experiments: swap the model, watch the score change
17:21 Phoenix AI skills and self-improving agent loops

Resources:
🔬 Phoenix (open source): https://phoenix.arize.com
🔗 Arize AX: https://arize.com
📖 OpenInference: https://github.com/Arize-ai/openinference
📖 Phoenix docs: https://docs.arize.com/phoenix

Got an agent that's been driving you nuts in production? Drop the specific failure mode in the comments below—our engineering team reads and reviews every single one.

If this deep dive leveled up your AI infrastructure toolkit, make sure to like, subscribe, and hit the bell for more technical agent engineering videos: https://www.youtube.com/@arizeai?sub_confirmation=1

#AIEngineering #AIAgents #AgentEvals

Видео How To Debug AI Agents: Tracing, Observability & Evals канала Arize AI