QUICK REVIEW

[论文解读] Superhuman performance of a large language model on the reasoning tasks of a physician

Peter G. Brodeur, Thomas A. Buckley|arXiv (Cornell University)|Dec 14, 2024

Clinical Reasoning and Diagnostic Skills被引用 23

一句话总结

该论文评估一个大型语言模型在具有挑战性的医学推理任务和基于急诊室的二次意见中的表现，相较于医生，在多个诊断与管理推理任务中显示出超越人类的表现。

ABSTRACT

A seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abilities, as well as continued improvement from prior generations of AI clinical decision support. Our study suggests that LLMs have achieved superhuman performance on general medical diagnostic and management reasoning, fulfilling the vision put forth by Ledley and Lusted, and motivating the urgent need for prospective trials.

研究动机与目标

评估LLM在鉴别诊断生成能力、诊断推理展示、分诊鉴别诊断、概率推理，以及管理推理方面的能力。
将LLM的表现与使用经过验证的心理测量工具在临床病例中的数百名医生进行比较。
通过急诊科研究比较AI二次意见与人类专家在关键诊断接触点的现实适用性进行评估。

提出的方法

进行五项实验，评估与医生基准相比的核心临床推理任务。
由医生专家和经过验证的心理测量工具对结果进行裁定。
进行现实世界的急诊科研究，在分诊、初步评估和住院决策等关键点比较AI与医生的二次意见。
利用大型语言模型在受控示例情境下生成鉴别诊断和诊断推理。
分析LLM输出与标准临床推理过程之间的一致性。

实验结果

研究问题

RQ1大型语言模型能否为具有挑战性的临床病例生成高质量的鉴别诊断？
RQ2与医生相比，LLM如何展示并证明诊断推理？
RQ3LLM是否在临床情景中改善了概率推理和管理推理？
RQ4在急诊科中AI二次意见在预定义的接触点上是否至少与人类二次意见同样准确？

主要发现

LLM在基于示例的评估中展现出超人级的诊断和推理能力。
LLM在临床决策支持任务中相较于先前的AI代在持续改善。
在真实世界的急诊环境中，分诊、初步评估和住院决策时AI二次意见与医生基准相匹配或优于之。
在五项实验中，LLM在由专家裁定的核心推理任务上优于医生。
本研究支持对LLMs在医学决策中的前瞻性试验和现实世界部署。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。