[論文レビュー] Clinical knowledge in LLMs does not translate to human interactions
LLMs alone perform well on medical tasks, but when paired with real users they do not improve and often underperform compared to traditional methods; benchmark and simulated-interaction tests fail to predict real-world failures.
Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.
研究の動機と目的
- Assess whether LLMs can help the general public identify potential medical conditions and dispositions across realistic scenarios.
- Compare three LLMs (GPT-4o, Llama 3, Command R+) and a control condition in a large randomized trial.
- Analyze how human-LLM interactions affect diagnostic and disposition accuracy.
- Evaluate whether standard medical benchmarks and simulated interactions predict real-world performance.
提案手法
- Randomized controlled trial with 1,298 UK participants assigned to four arms across ten medical scenarios.
- Three treatment arms used an LLM (GPT-4o, Llama 3, Command R+) to assist in identifying conditions and dispositions; control used participants’ usual methods.
- Scenarios were drafted by doctors with unanimous disposition consensus and gold-standard condition lists created by additional physicians.
- Participants produced dispositions on a five-point scale and listed relevant conditions; model prompts and human interactions were analyzed to identify transmission failures.
実験結果
リサーチクエスチョン
- RQ1Can members of the public accurately identify relevant medical conditions and disposition decisions when assisted by LLMs?
- RQ2Do LLMs improve disposition accuracy or condition identification compared with traditional at-home resources?
- RQ3Do standard medical knowledge benchmarks predict performance in interactive, real-world settings?
- RQ4Do simulated patient interactions reflect human-LLM performance and provide scalable benchmarking?
主な発見
- LLMs alone identified relevant conditions in 94.9% (GPT-4o), 99.2% (Llama 3), and 90.8% (Command R+) of cases; disposition accuracy was 64.7%, 48.8%, and 55.5% respectively.
- Participants using any LLM identified relevant conditions in at most 34.5% of cases and disposition in at most 44.2%, not better than the control group.
- Interactions between users and LLMs showed transmission failures: LLMs suggested relevant conditions in 65.7–73.2% of conversations, but users often omitted information or did not act on it.
- Benchmarks (MedQA) often overestimate interactive performance; 26/30 cases showed higher model accuracy on QA than human-LLM interaction accuracy, and simulated participants poorly reflected human variability.]
- table_headers: [],
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。