Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge

Electronic Health Records (EHRs) contain vast amounts of clinical data, yet providers often struggle to distill this information into clear and actionable insights. Large Language Models (LLMs) now offer the promise of automated summarization to reduce cognitive load, but ensuring the accuracy, safety, and reliability of these outputs is important for clinical use. In collaboration with Epic, our team developed and validated the Provider Documentation Summarization Quality Instrument (PDSQI-9) – a structured rubric for expert medical evaluation of LLM-generated summaries.

While human experts remain the gold standard for evaluation, this approach is resource-intensive and difficult to scale across real-world settings. To address this challenge, we then introduce LLM-as-a-Judge, an automated evaluation framework that benchmarks directly against PDSQI-9. Our results demonstrate that LLMs can achieve high inter-rater reliability with human evaluators while completing evaluations in seconds, enabling rapid, scalable quality assurance of AI outputs.

Speakers:
Brian Patterson, MD, MPH
Majid Afshar, MD, MS
Emma Croxford

Видео Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge канала Health AI Partnership

Комментарии отсутствуют

Информация о видео

19 декабря 2025 г. 3:15:07

00:41:14

Health AI Partnership

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

Scaling Medical Evaluation of LLM Summaries: From PDSQI-9 to LLM-as-a-Judge

November Health AI Hub: AI Enablement at Mayo Clinic

What Patients Think About AI in Health Care: Insights from California Focus Groups

Building Transparency: Artificial Intelligence Model Cards Inventory

Real-World AI Series: Precision Identification of Cardiac Amyloidosis in Diverse Population

AI-guided screening for cardiomyopathies in an obstetric population: a pragmatic RCT

Health AI Rights Developed By Patients For Patients

Beyond the Hype: Early AI Governance and Implementation Lessons in Community Health

Real-World AI Series: Generalizing An AI/ML Model for Pediatric Asthma Care

LLMs, Bias, and the Implications for Equitable Healthcare

From a Patchwork to a Quilt: Emerging Federal and State AI Policy Trends

Tangled In The Web: Patient Safety & Privacy Concerns

Leveraging Artificial Intelligence in the FQHC Setting: Lessons Learned in Early Implementation

Red-teaming AI Systems in Healthcare

Current Use and Evaluation of Predictive Models in US Hospitals

Lessons Learned from AI-Enabled Diabetic Retinopathy Screening at San Ysidro Health

Therabot: The First Randomized Controlled Trial of a Generative AI for Psychotherapy

LLM monitoring via the the Impact Monitoring Platform for AI in Clinical Care (IMPACC) program

Prediction model to improve PrEP prescribing: lessons learned from implementation.

The Little Clinic That Could: An AI Scribe Adventure Story

Harnessing AI in a Safety Net Hospital: Challenges, Opportunities, and Responsible Implementation