- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
How Khan Academy Became a Leader in A/B Testing AI for Better Education
Most teams building with LLMs are flying blind. Khan Academy isn't — and Dr. Kelli Hill, Senior Director of Data Insights, explains how they got there.
Their breakthrough was defining what "good" actually means for an AI tutor. Khan Academy's team built a cognitive engagement metric grounded in decades of learning science research, got human experts to agree on a rubric, then scaled it with LLM-as-a-judge. Suddenly, every Kahnmigo conversation became data they could learn from.
That metric is what turned vibes-based prompt testing into rigorous experimentation. Once you can measure quality, you can A/B test your way to a better tutor — one prompt tweak, one model swap, one latency optimization at a time.
In this session, Kelli walks through:
The three-year journey from Slack-based vibes testing to production A/B tests on gen AI features
- How they built a cognitive engagement metric that actually correlates with learning outcomes
- Why LLM-as-a-judge only works if you do the hard human labeling work first
- Their responsible experimentation framework for running tests on kids in classrooms
- A real case study on reducing math agent latency without sacrificing accuracy
Key takeaway: You can't A/B test your way to a great AI product without a great metric. Khan Academy did the hard work to define one — and it's what makes every experiment after that possible.
Featuring: Dr. Kelli Hill, Senior Director of Data Insights at Khan Academy and Luke Sonnet, Head of Experimentation at GrowthBook.
00:00 Introduction
00:31 Khan Academy Overview
02:39 History of Experimentation at Khan Academy
07:11 Theory of Action for Conmigo
08:33 Challenges with Gen AI Evals
11:07 Phase 1: Basic Evals & Vibes Testing
14:12 Phase 2: Post Hoc Evals & LLM as Judge
17:49 Phase 3: AB Testing in Production
21:20 Responsible Experimentation Framework
25:03 Case Study: Math Agent Latency
28:36 Key Takeaways
31:04 Q&A
Видео How Khan Academy Became a Leader in A/B Testing AI for Better Education канала GrowthBook
Their breakthrough was defining what "good" actually means for an AI tutor. Khan Academy's team built a cognitive engagement metric grounded in decades of learning science research, got human experts to agree on a rubric, then scaled it with LLM-as-a-judge. Suddenly, every Kahnmigo conversation became data they could learn from.
That metric is what turned vibes-based prompt testing into rigorous experimentation. Once you can measure quality, you can A/B test your way to a better tutor — one prompt tweak, one model swap, one latency optimization at a time.
In this session, Kelli walks through:
The three-year journey from Slack-based vibes testing to production A/B tests on gen AI features
- How they built a cognitive engagement metric that actually correlates with learning outcomes
- Why LLM-as-a-judge only works if you do the hard human labeling work first
- Their responsible experimentation framework for running tests on kids in classrooms
- A real case study on reducing math agent latency without sacrificing accuracy
Key takeaway: You can't A/B test your way to a great AI product without a great metric. Khan Academy did the hard work to define one — and it's what makes every experiment after that possible.
Featuring: Dr. Kelli Hill, Senior Director of Data Insights at Khan Academy and Luke Sonnet, Head of Experimentation at GrowthBook.
00:00 Introduction
00:31 Khan Academy Overview
02:39 History of Experimentation at Khan Academy
07:11 Theory of Action for Conmigo
08:33 Challenges with Gen AI Evals
11:07 Phase 1: Basic Evals & Vibes Testing
14:12 Phase 2: Post Hoc Evals & LLM as Judge
17:49 Phase 3: AB Testing in Production
21:20 Responsible Experimentation Framework
25:03 Case Study: Math Agent Latency
28:36 Key Takeaways
31:04 Q&A
Видео How Khan Academy Became a Leader in A/B Testing AI for Better Education канала GrowthBook
Комментарии отсутствуют
Информация о видео
17 апреля 2026 г. 10:49:43
00:49:31
Другие видео канала




















