Загрузка...

Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge

Electronic Health Records (EHRs) contain vast amounts of clinical data, yet providers often struggle to distill this information into clear and actionable insights. Large Language Models (LLMs) now offer the promise of automated summarization to reduce cognitive load, but ensuring the accuracy, safety, and reliability of these outputs is important for clinical use. In collaboration with Epic, our team developed and validated the Provider Documentation Summarization Quality Instrument (PDSQI-9) – a structured rubric for expert medical evaluation of LLM-generated summaries.

While human experts remain the gold standard for evaluation, this approach is resource-intensive and difficult to scale across real-world settings. To address this challenge, we then introduce LLM-as-a-Judge, an automated evaluation framework that benchmarks directly against PDSQI-9. Our results demonstrate that LLMs can achieve high inter-rater reliability with human evaluators while completing evaluations in seconds, enabling rapid, scalable quality assurance of AI outputs.

Speakers:
Brian Patterson, MD, MPH
Majid Afshar, MD, MS
Emma Croxford

Видео Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge канала Health AI Partnership
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять