- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge
Electronic Health Records (EHRs) contain vast amounts of clinical data, yet providers often struggle to distill this information into clear and actionable insights. Large Language Models (LLMs) now offer the promise of automated summarization to reduce cognitive load, but ensuring the accuracy, safety, and reliability of these outputs is important for clinical use. In collaboration with Epic, our team developed and validated the Provider Documentation Summarization Quality Instrument (PDSQI-9) – a structured rubric for expert medical evaluation of LLM-generated summaries.
While human experts remain the gold standard for evaluation, this approach is resource-intensive and difficult to scale across real-world settings. To address this challenge, we then introduce LLM-as-a-Judge, an automated evaluation framework that benchmarks directly against PDSQI-9. Our results demonstrate that LLMs can achieve high inter-rater reliability with human evaluators while completing evaluations in seconds, enabling rapid, scalable quality assurance of AI outputs.
Speakers:
Brian Patterson, MD, MPH
Majid Afshar, MD, MS
Emma Croxford
Видео Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge канала Health AI Partnership
While human experts remain the gold standard for evaluation, this approach is resource-intensive and difficult to scale across real-world settings. To address this challenge, we then introduce LLM-as-a-Judge, an automated evaluation framework that benchmarks directly against PDSQI-9. Our results demonstrate that LLMs can achieve high inter-rater reliability with human evaluators while completing evaluations in seconds, enabling rapid, scalable quality assurance of AI outputs.
Speakers:
Brian Patterson, MD, MPH
Majid Afshar, MD, MS
Emma Croxford
Видео Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge канала Health AI Partnership
Комментарии отсутствуют
Информация о видео
19 декабря 2025 г. 3:15:07
00:41:14
Другие видео канала




















