- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
Russell Yang of Stanford Law dives into his paper "JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment" at the Snorkel AI Reading Group in San Francisco.
Read the paper: https://arxiv.org/abs/2605.25240
Subscribe to be notified of future Reading Group and other learning events: https://luma.com/snorkel-ai
Follow Russell on:
LinkedIn: / russell-yang
CHAPTERS
0:00 Introduction
1:40 Russell's intro & talk overview
1:58 Science vs. art: where does legal practice fall?
4:09 Two evaluation paradigms: rubrics vs. comparative judgment
5:25 Expert attorney annotators
6:37 BigLawBench & the JudgmentBench dataset
7:48 Constructing quality levels without ground truth
10:47 Key finding: comparative judgment wins
11:15 Two metrics: rank correlation & win rate
12:56 Results: 67% vs. 54% win rate, 0.908 vs. 0.150 Spearman's ρ
14:35 Implications for benchmarkers and law firms (buying tools & building tooling)
15:02 Implications for law firms (buying tools & building tooling)
15:14 What's next: auto-rubrics from comparative judgment data
16:09 Q&A
Russell covers the motivation for comparing rubric and preference evaluation, why ground truth is especially contested in legal domains, how JudgmentBench was constructed using 50+ practicing attorneys and Harvey's BigLawBench, the two metrics used to measure quality recovery, and the striking gap between comparative judgment and rubric-based scoring, along with the implications for model benchmarking, law firm tooling decisions, and the future of expert data collection.
Видео JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment канала Snorkel AI
Read the paper: https://arxiv.org/abs/2605.25240
Subscribe to be notified of future Reading Group and other learning events: https://luma.com/snorkel-ai
Follow Russell on:
LinkedIn: / russell-yang
CHAPTERS
0:00 Introduction
1:40 Russell's intro & talk overview
1:58 Science vs. art: where does legal practice fall?
4:09 Two evaluation paradigms: rubrics vs. comparative judgment
5:25 Expert attorney annotators
6:37 BigLawBench & the JudgmentBench dataset
7:48 Constructing quality levels without ground truth
10:47 Key finding: comparative judgment wins
11:15 Two metrics: rank correlation & win rate
12:56 Results: 67% vs. 54% win rate, 0.908 vs. 0.150 Spearman's ρ
14:35 Implications for benchmarkers and law firms (buying tools & building tooling)
15:02 Implications for law firms (buying tools & building tooling)
15:14 What's next: auto-rubrics from comparative judgment data
16:09 Q&A
Russell covers the motivation for comparing rubric and preference evaluation, why ground truth is especially contested in legal domains, how JudgmentBench was constructed using 50+ practicing attorneys and Harvey's BigLawBench, the two metrics used to measure quality recovery, and the striking gap between comparative judgment and rubric-based scoring, along with the implications for model benchmarking, law firm tooling decisions, and the future of expert data collection.
Видео JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment канала Snorkel AI
Комментарии отсутствуют
Информация о видео
18 июня 2026 г. 21:41:51
00:35:49
Другие видео канала




















