The maturity phases of running evals — Phil Hetzel, Braintrust

Most teams approach evals like unit tests and try to cover every possible failure. Phil Hetzel from Braintrust argues that is the wrong frame: enumerate your known failure modes, cover those specifically, and ship. The goal is a flywheel where production traces surface what is going wrong, feed back into offline experimentation, and guide the next improvement.

The session walks four maturity stages: vibe checking with documented human justifications not just thumbs up or down, LLM as judge built from those justifications at scale, then the hard part, tool calls that touch external systems. Context gathering tools are manageable. CRUD tools are not, because you have to represent the state of external systems at the exact moment the original trace ran. Timestamp queries against a vector database and injecting captured system state directly into the trace are two approaches for getting there.

Speaker info:
- https://www.linkedin.com/in/philliphetzel/

Видео The maturity phases of running evals — Phil Hetzel, Braintrust канала AI Engineer

ai ai engineer ai engineering software development tech startups software architecture machine learning

Комментарии отсутствуют

Информация о видео

Вчера, 18:00:06

00:18:34

AI Engineer

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

The maturity phases of running evals — Phil Hetzel, Braintrust

Brian Balfour: The #1 Question Every AI Product Manager Must Answer

Building Protected MCP Servers — Den Delimarsky and Julia Kasper, MCP Steering Committee & Microsoft

The Knowledge Graph Mullet: Trimming GraphRAG Complexity - William Lyon

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Building Agentic Applications w/ Heroku Managed Inference and Agents — Julián Duque & Anush Dsouza

Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

The State of MCP observability: Observable.tools — Alex Volkov and Benjamin Eckel, W&B and Dylibso

Contact Center Voice AI: Low-Latency Intelligence Extraction from Messy Audio Streams — Dippu Singh

RAG at scale: production ready GenAI apps with Azure AI Search

Unlocking Africa's Potential with AI — Thabang Ledwaba

Machines of Buying and Selling Grace - Adam Behrens, New Generation

Veo 3 for Developers — Paige Bailey, Google DeepMind

The Bitter Layout or: How I Learned to Love the Model Picker — Maximillian Piras, Yutori

Bending a Public MCP Server Without Breaking It — Nimrod Hauser, Baz

Measuring AGI: Interactive Reasoning Benchmarks for ARC-AGI-3 — Greg Kamradt, ARC Prize Foundation

From Copilot to Colleague: Trustworthy Agents for High-Stakes - Joel Hron, CTO Thomson Reuters

Context Platform Engineering to Reduce Token Anxiety — Val Bercovici, WEKA

Revenue Engineering: How to Price (and Reprice) Your AI Product — Kshitij Grover, Orb

Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway