QUICK REVIEW

[论文解读] Large Language Models Perform Diagnostic Reasoning

Cheng-Kuang Wu, Weilin Chen|arXiv (Cornell University)|Jul 18, 2023

Topic Modeling被引用 9

一句话总结

DR-CoT prompting improves diagnostic accuracy of LLMs for automatic diagnosis by about 15% over standard prompting, with an 18% gain in out-domain settings.

ABSTRACT

We explore the extension of chain-of-thought (CoT) prompting to medical reasoning for the task of automatic diagnosis. Motivated by doctors' underlying reasoning process, we present Diagnostic-Reasoning CoT (DR-CoT). Empirical results demonstrate that by simply prompting large language models trained only on general text corpus with two DR-CoT exemplars, the diagnostic accuracy improves by 15% comparing to standard prompting. Moreover, the gap reaches a pronounced 18% in out-domain settings. Our findings suggest expert-knowledge reasoning in large language models can be elicited through proper promptings.

研究动机与目标

Motivate extending chain-of-thought prompting to medical reasoning for automatic diagnosis.
Propose Diagnostic-Reasoning CoT (DR-CoT) to elicit expert-like reasoning in LLMs.
Develop a few-shot LLM-based dialogue system for automatic diagnosis.
Introduce a language-model-role-playing evaluation framework to simulate patient-doctor interactions.

提出的方法

Prompt LLMs with a two-shot DR-CoT template to guide evidence gathering and differential diagnosis generation.
Augment the instruction to summarize evidence and formulate a differential diagnosis before formulating the next question.
Replace standard prompts with a DR-CoT-driven prompt that ties evidence to a ranked differential and next query.
Use a non-pipelined, few-shot dialogue setup where the model generates questions and a final diagnosis.
Evaluate using a language-model-role-playing framework where the LLM acts as both doctor and patient in self-chat dialogues.
Conduct experiments on the DDXPlus dataset with in-domain and out-domain splits.

Figure 3: The initial prompt includes the instruction I , the shots S , and the input D . The generated question $q_{i}$ of the prompted model (i.e., the DSAD) and the answer $a_{i}$ from the patient bot is presented in the remaining text in black.

实验结果

研究问题

RQ1Can DR-CoT prompting improve diagnostic accuracy of LLM-based automatic diagnosis compared to standard prompting?
RQ2Does DR-CoT generalize to out-domain initial evidences beyond the exemplars?
RQ3Does the DR-CoT approach lead to more informative questioning that supports correct diagnoses?
RQ4Is a language-model-role-playing evaluation framework a viable proxy for realistic DSAD assessment?

主要发现

DR-CoT prompting yields a 15% improvement in diagnostic accuracy over standard prompting.
The accuracy improvement with DR-CoT is 18% in out-domain settings.
Two-shot exemplars with DR-CoT significantly enhance convergence speed and diagnostic performance.
A physician-evaluated human study supports that DR-CoT prompts help the model ask more critical questions.
The evaluation framework using role-playing between doctor and patient enables automated, end-to-end assessment.

Large Language Models Perform Diagnostic Reasoning

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。