QUICK REVIEW

[论文解读] ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge

Yunxiang Li, Li Zihan|arXiv (Cornell University)|Mar 24, 2023

Machine Learning in Healthcare被引用 33

一句话总结

ChatDoctor 在 100k 真实的患者-医生对话上微调 LLaMA，并为其配备来自在线/离线来源的自主外部知识检索，以提高医疗回应的准确性，相较于 ChatGPT。

ABSTRACT

The primary aim of this research was to address the limitations observed in the medical knowledge of prevalent large language models (LLMs) such as ChatGPT, by creating a specialized language model with enhanced accuracy in medical advice. We achieved this by adapting and refining the large language model meta-AI (LLaMA) using a large dataset of 100,000 patient-doctor dialogues sourced from a widely used online medical consultation platform. These conversations were cleaned and anonymized to respect privacy concerns. In addition to the model refinement, we incorporated a self-directed information retrieval mechanism, allowing the model to access and utilize real-time information from online sources like Wikipedia and data from curated offline medical databases. The fine-tuning of the model with real-world patient-doctor interactions significantly improved the model's ability to understand patient needs and provide informed advice. By equipping the model with self-directed information retrieval from reliable online and offline sources, we observed substantial improvements in the accuracy of its responses. Our proposed ChatDoctor, represents a significant advancement in medical LLMs, demonstrating a significant improvement in understanding patient inquiries and providing accurate advice. Given the high stakes and low error tolerance in the medical field, such enhancements in providing accurate and reliable information are not only beneficial but essential.

研究动机与目标

通过在现实世界的患者-医生对话上微调语言模型来提升医疗对话的准确性。
通过整合一个外部知识大脑（在线/离线）实现实时信息检索来提升可靠性。
在医疗查询的精确度、召回率和 F1 指标上，展示相对于通用领域模型（ChatGPT）的改进。

提出的方法

按照 Alpaca 风格的指令微调，在 HealthCareMagic-100k 的患者-医生对话上对 LLaMA-7B 进行微调。
以 MedlinePlus 派生的疾病数据以及像 Wikipedia 这样的补充来源来创建外部知识大脑。
开发用于从查询中提取关键词以进行知识检索的自主关键词挖掘提示。
实现一个高排名、以关键词驱动的检索系统，对检索内容进行分段文本和令牌限制处理。
提示模型读取检索到的知识章节并生成最终、知情的回答。
使用来自 iCliniq 的问题进行评估，以人类医生的回应作为真实值，对精确度、召回率和 F1 应用 BERTScore。

实验结果

研究问题

RQ1一个在医疗领域微调的大语言模型是否能够在医疗对话任务中胜过通用领域模型（ChatGPT）？
RQ2添加自主外部知识检索机制是否能提高医疗问题的回答准确性和时效性？
RQ3在训练集中未包含相对较新术语或疾病的查询上（例如 Mpox、Daybue），ChatDoctor 的表现如何？

主要发现

模型	精确度	召回率	F1	P 值
ChatGPT	0.837±0.0188	0.8445±0.0164	0.8406±0.0143
ChatDoctor	0.8444±0.0185	0.8451±0.0157	0.8446±0.0138

在所报告的评估中，ChatDoctor 在 BERTScore 的 Precision、Recall 和 F1 上优于 ChatGPT。
自主知识检索使对更新术语（如 Mpox）和新近获批的药物（如 Daybue）的回答正确。
定性示例显示，在若干情境中，ChatDoctor 提供的医学指导更专业且更具证据支持，优于 ChatGPT。
该模型使用 HealthCareMagic-100k 进行微调，iCliniq 用于测试数据，在所呈现的指标中显示出改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。