How Khan Academy Became a Leader in A/B Testing AI for Better Education

Most teams building with LLMs are flying blind. Khan Academy isn't — and Dr. Kelli Hill, Senior Director of Data Insights, explains how they got there.

Their breakthrough was defining what "good" actually means for an AI tutor. Khan Academy's team built a cognitive engagement metric grounded in decades of learning science research, got human experts to agree on a rubric, then scaled it with LLM-as-a-judge. Suddenly, every Kahnmigo conversation became data they could learn from.

That metric is what turned vibes-based prompt testing into rigorous experimentation. Once you can measure quality, you can A/B test your way to a better tutor — one prompt tweak, one model swap, one latency optimization at a time.

In this session, Kelli walks through:

The three-year journey from Slack-based vibes testing to production A/B tests on gen AI features
- How they built a cognitive engagement metric that actually correlates with learning outcomes
- Why LLM-as-a-judge only works if you do the hard human labeling work first
- Their responsible experimentation framework for running tests on kids in classrooms
- A real case study on reducing math agent latency without sacrificing accuracy

Key takeaway: You can't A/B test your way to a great AI product without a great metric. Khan Academy did the hard work to define one — and it's what makes every experiment after that possible.

Featuring: Dr. Kelli Hill, Senior Director of Data Insights at Khan Academy and Luke Sonnet, Head of Experimentation at GrowthBook.
00:00 Introduction
00:31 Khan Academy Overview
02:39 History of Experimentation at Khan Academy
07:11 Theory of Action for Conmigo
08:33 Challenges with Gen AI Evals
11:07 Phase 1: Basic Evals & Vibes Testing
14:12 Phase 2: Post Hoc Evals & LLM as Judge
17:49 Phase 3: AB Testing in Production
21:20 Responsible Experimentation Framework
25:03 Case Study: Math Agent Latency
28:36 Key Takeaways
31:04 Q&A

Видео How Khan Academy Became a Leader in A/B Testing AI for Better Education канала GrowthBook

Descript

Комментарии отсутствуют