JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

Russell Yang of Stanford Law dives into his paper "JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment" at the Snorkel AI Reading Group in San Francisco.

Read the paper: https://arxiv.org/abs/2605.25240
Subscribe to be notified of future Reading Group and other learning events: https://luma.com/snorkel-ai

Follow Russell on:
LinkedIn: / russell-yang

CHAPTERS
0:00 Introduction
1:40 Russell's intro & talk overview
1:58 Science vs. art: where does legal practice fall?
4:09 Two evaluation paradigms: rubrics vs. comparative judgment
5:25 Expert attorney annotators
6:37 BigLawBench & the JudgmentBench dataset
7:48 Constructing quality levels without ground truth
10:47 Key finding: comparative judgment wins
11:15 Two metrics: rank correlation & win rate
12:56 Results: 67% vs. 54% win rate, 0.908 vs. 0.150 Spearman's ρ
14:35 Implications for benchmarkers and law firms (buying tools & building tooling)
15:02 Implications for law firms (buying tools & building tooling)
15:14 What's next: auto-rubrics from comparative judgment data
16:09 Q&A

Russell covers the motivation for comparing rubric and preference evaluation, why ground truth is especially contested in legal domains, how JudgmentBench was constructed using 50+ practicing attorneys and Harvey's BigLawBench, the two metrics used to measure quality recovery, and the striking gap between comparative judgment and rubric-based scoring, along with the implications for model benchmarking, law firm tooling decisions, and the future of expert data collection.

Видео JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment канала Snorkel AI

Комментарии отсутствуют