Skip to main content
QUICK REVIEW

[论文解读] Large Language Models Perform Diagnostic Reasoning

Cheng-Kuang Wu, Weilin Chen|arXiv (Cornell University)|Jul 18, 2023
Topic Modeling被引用 9
一句话总结

DR-CoT prompting improves diagnostic accuracy of LLMs for automatic diagnosis by about 15% over standard prompting, with an 18% gain in out-domain settings.

ABSTRACT

We explore the extension of chain-of-thought (CoT) prompting to medical reasoning for the task of automatic diagnosis. Motivated by doctors' underlying reasoning process, we present Diagnostic-Reasoning CoT (DR-CoT). Empirical results demonstrate that by simply prompting large language models trained only on general text corpus with two DR-CoT exemplars, the diagnostic accuracy improves by 15% comparing to standard prompting. Moreover, the gap reaches a pronounced 18% in out-domain settings. Our findings suggest expert-knowledge reasoning in large language models can be elicited through proper promptings.

研究动机与目标

  • Motivate extending chain-of-thought prompting to medical reasoning for automatic diagnosis.
  • Propose Diagnostic-Reasoning CoT (DR-CoT) to elicit expert-like reasoning in LLMs.
  • Develop a few-shot LLM-based dialogue system for automatic diagnosis.
  • Introduce a language-model-role-playing evaluation framework to simulate patient-doctor interactions.

提出的方法

  • Prompt LLMs with a two-shot DR-CoT template to guide evidence gathering and differential diagnosis generation.
  • Augment the instruction to summarize evidence and formulate a differential diagnosis before formulating the next question.
  • Replace standard prompts with a DR-CoT-driven prompt that ties evidence to a ranked differential and next query.
  • Use a non-pipelined, few-shot dialogue setup where the model generates questions and a final diagnosis.
  • Evaluate using a language-model-role-playing framework where the LLM acts as both doctor and patient in self-chat dialogues.
  • Conduct experiments on the DDXPlus dataset with in-domain and out-domain splits.
Figure 3: The initial prompt includes the instruction I , the shots S , and the input D . The generated question $q_{i}$ of the prompted model (i.e., the DSAD) and the answer $a_{i}$ from the patient bot is presented in the remaining text in black.
Figure 3: The initial prompt includes the instruction I , the shots S , and the input D . The generated question $q_{i}$ of the prompted model (i.e., the DSAD) and the answer $a_{i}$ from the patient bot is presented in the remaining text in black.

实验结果

研究问题

  • RQ1Can DR-CoT prompting improve diagnostic accuracy of LLM-based automatic diagnosis compared to standard prompting?
  • RQ2Does DR-CoT generalize to out-domain initial evidences beyond the exemplars?
  • RQ3Does the DR-CoT approach lead to more informative questioning that supports correct diagnoses?
  • RQ4Is a language-model-role-playing evaluation framework a viable proxy for realistic DSAD assessment?

主要发现

  • DR-CoT prompting yields a 15% improvement in diagnostic accuracy over standard prompting.
  • The accuracy improvement with DR-CoT is 18% in out-domain settings.
  • Two-shot exemplars with DR-CoT significantly enhance convergence speed and diagnostic performance.
  • A physician-evaluated human study supports that DR-CoT prompts help the model ask more critical questions.
  • The evaluation framework using role-playing between doctor and patient enables automated, end-to-end assessment.
Large Language Models Perform Diagnostic Reasoning

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。