QUICK REVIEW

[论文解读] Towards Accurate Differential Diagnosis with Large Language Models

Daniel McDuff, Mike Schaekermann|arXiv (Cornell University)|Nov 30, 2023

Machine Learning in Healthcare被引用 62

一句话总结

一个针对鉴别诊断优化的大语言模型在NEJM CPC病例中，无论是单独使用还是作为辅助工具，均优于基线临床医生和GPT-4，在顶-1和顶-10 DDx准确性以及DDx质量指标方面。

ABSTRACT

An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

研究动机与目标

促成在临床工作流程中通过互动AI提升DDx。
开发在医疗数据上训练的用于诊断推理的专门LLM。
评估独立DDx表现与临床医生生成的DDx的对比。
评估LLM辅助的DDx生成相对于传统基于检索的辅助。
探索临床医生对安全性、实用性和教育潜力的定性看法。

提出的方法

在医疗问答、医疗对话和电子病历笔记摘要上微调基于PaLM 2的LLM，以实现长上下文推理。
使用NEJM CPC病例报道（302例）来评估DDx生成：i）独立LLM，ii）LLM辅助的临床医生DDx生成， iii）仅检索的临床医生DDx。
实现两阶段读者研究，随机分配条件并对DDx质量进行盲评专家评估。
通过前N准确性进行定量评估DDx清单，以及定性/结构化质量指标（Bond et al. differential score、适当性、全面性）。
通过Med-PaLM 2对DDx清单的预测诊断与真实诊断进行自动化评估。
对临床医生进行半结构化访谈，捕捉感知与使用场景。

实验结果

研究问题

RQ1一个医学领域的LLM是否能够在真实世界的挑战性病例上生成准确的鉴别诊断？
RQ2LLM辅助是否提升临床医生的DDx质量、全面性以及与最终诊断的一致性，相较于传统检索工具？
RQ3在同一DDx基准测试中，LLM相对于GPT-4的自动化评估表现如何？
RQ4临床医生对安全性、实用性以及LLMs在鉴别诊断中的潜在角色有何看法？
RQ5将基于LLM的DDx工具整合到临床教育和护理提供中的实际考虑因素有哪些？

主要发现

LLM在302例NEJM CPC病例上实现了顶-10 DDx准确性59.1%（无辅助的临床医生为33.6%，优于其）。
在临床医生辅助条件下，LLM将顶-10准确性提高到51.7%，而无LLM时为36.1%（McNemar检验，p<0.01）。
LLM的DDx清单质量更高（中位数5），全面性和适当性也优于无辅助的临床医生，差异显著（p<0.01至p<0.001）。
LLM辅助条件产生的DDx清单更长、更全面（中位长度8），相比无辅助（6）和基于检索的辅助（7）。
定性访谈显示临床医生认为具有教育价值和扩大获得专业级推理的潜力，同时指出不准确风险和需要人类监督。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。