QUICK REVIEW

[論文レビュー] Superhuman performance of a large language model on the reasoning tasks of a physician

Peter G. Brodeur, Thomas A. Buckley|arXiv (Cornell University)|Dec 14, 2024

Clinical Reasoning and Diagnostic Skills被引用数 23

ひとこと要約

本論文は、難易度の高い医療推論タスクおよびERベースのセカンドオピニオンにおける大規模言語モデル（LLM）の性能を評価し、診断およびマネジメント推論の複数のタスクにおいて医師と比較して超人レベルの性能を報告している。

ABSTRACT

A seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abilities, as well as continued improvement from prior generations of AI clinical decision support. Our study suggests that LLMs have achieved superhuman performance on general medical diagnostic and management reasoning, fulfilling the vision put forth by Ledley and Lusted, and motivating the urgent need for prospective trials.

研究の動機と目的

差分診断生成、診断推論の表示、トリアージ差分診断、確率論的推論、そしてマネジメント推論におけるLLMの能力を評価する。
臨床ビネットにおいて検証済み心理検査法を用いて、LLMの性能を何百人もの医師と比較する。
主要な診断タッチポイントにおけるAIセカンドオピニオンと人間の専門家を比較する救急科での実世界適用性を評価する。

提案手法

医師のベンチマークと比較して、臨床推論の核心タスクを評価する5つの実験を実施する。
医師専門家と検証済み心理測定法により結果を評定する。
トリアージ、初期評価、入院決定の場面でAIと医師のセカンドオピニオンを比較する実世界のER研究を実施する。
制御されたビネットの下で差分診断と診断推論を生成するために大規模言語モデルを利用する。
LLMの出力と標準的な臨床推論過程との整合性を分析する。

実験結果

リサーチクエスチョン

RQ1難易度の高い臨床症例に対して、高品質な差分診断を生成できる大規模言語モデルは存在するか？
RQ2LLMは医師と比較して診断推論をどのように提示し、正当化するか？
RQ3臨床シナリオにおける確率的推論とマネジメント推論をLLMは改善するか？
RQ4救急科におけるAIセカンドオピニオンは、事前に定義されたタッチポイント全般で人間のセカンドオピニオンと少なくとも同等の正確さを示すか？

主な発見

LLMはビネットベースの評価で超人レベルの診断能力と推論能力を示した。
LLMは臨床意思決定サポートタスクで以前のAI世代を上回る継続的な改善を示した。
実世界のER環境では、トリアージ、初期評価、入院決定におけるAIセカンドオピニオンが医師のベンチマークと同等か上回った。
5つの実験を通じて、専門家が評価した核心推論タスクでLLMが医師を上回った。
本研究は医療意思決定におけるLLMsの前向き試験と実世界での展開を支持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。