QUICK REVIEW

[Paper Review] Superhuman performance of a large language model on the reasoning tasks of a physician

Peter G. Brodeur, Thomas A. Buckley|arXiv (Cornell University)|Dec 14, 2024

Clinical Reasoning and Diagnostic Skills23 citations

TL;DR

The paper evaluates a large language model’s performance on challenging medical reasoning tasks and in ER-based second opinions, reporting superhuman performance across multiple diagnostic and management reasoning tasks compared to physicians.

ABSTRACT

A seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abilities, as well as continued improvement from prior generations of AI clinical decision support. Our study suggests that LLMs have achieved superhuman performance on general medical diagnostic and management reasoning, fulfilling the vision put forth by Ledley and Lusted, and motivating the urgent need for prospective trials.

Motivation & Objective

Assess the LLM's capability in differential diagnosis generation, diagnostic reasoning display, triage differential diagnosis, probabilistic reasoning, and management reasoning.
Compare LLM performance to hundreds of physicians using validated psychometrics on clinical vignettes.
Evaluate real-world applicability through an emergency department study comparing AI second opinions with human experts at key diagnostic touchpoints.

Proposed method

Conduct five experiments assessing core clinical reasoning tasks against physician benchmarks.
Adjudicate outcomes with physician experts and validated psychometrics.
Perform a real-world ER study comparing AI and physician second opinions at triage, initial evaluation, and admission decisions.
Utilize a large language model to generate differential diagnoses and diagnostic reasoning under controlled vignettes.
Analyze alignment between LLM outputs and standard clinical reasoning processes.

Experimental results

Research questions

RQ1Can a large language model generate high-quality differential diagnoses for challenging clinical cases?
RQ2How does the LLM display and justify diagnostic reasoning compared with physicians?
RQ3Does the LLM improve probabilistic and management reasoning in clinical scenarios?
RQ4Are AI second opinions in an emergency department at least as accurate as human second opinions across predefined touchpoints?

Key findings

The LLM demonstrated superhuman diagnostic and reasoning abilities in vignette-based evaluations.
The LLM showed continued improvement over prior AI generations in clinical decision support tasks.
In a real-world ER setting, AI second opinions at triage, initial evaluation, and admission decisions matched or exceeded physician benchmarks.
Across five experiments, the LLM outperformed physicians on core reasoning tasks adjudicated by experts.
The study supports prospective trials and real-world deployment of LLMs in medical decision-making.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.