QUICK REVIEW

[论文解读] Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Cyril Zakka, Akash Chaurasia|arXiv (Cornell University)|Mar 1, 2023

Topic Modeling被引用 25

一句话总结

Almanac 将大型语言模型与来自医学来源的检索相结合，以回答临床问题，在提升基线的事实性和安全性的同时，能够提供基于来源的回应。

ABSTRACT

Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n = 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.

研究动机与目标

在临床工作流程中解决医疗领域大模型在事实性、完整性和安全性方面的挑战。
评估 Almanac 在检索来源和文内引文的支撑下为答案提供依据的能力。
使用由医生主导的评估量表评估在多学科领域的表现。

提出的方法

使用向量数据库对医学内容进行语义存储，并通过 text-embedding-ada-002 的 1,536 维嵌入执行近似最近邻检索（HNSW）。
将文章检索并切分为 1,000-token 的段落，针对查询对文档打分，并将前匹配提供给微调后的 LLM (text-davinci-003) 进行带引文的答案生成。
采用检索增强生成管道，将上下文提示与推理链结合，在信息不足时选择不回答。
使用由权威认证医生组成的评审小组，在 ClinicalQA 数据集（n=130）和对抗性提示上评估事实性、完整性和安全性。
提供一个新颖的 ClinicalQA 基准，覆盖心胸外科、心脏病学、神经学、传染病学和儿科，以反映真实世界的临床问题。

实验结果

研究问题

RQ1与无检索基线相比，Almanac 是否提高了临床答案的事实性？
RQ2在开放式临床查询中，Almanac 是否实现了更高的安全性与完整性？
RQ3检索支撑的语言模型是否能够在多种医疗专科中提供可靠的、可引来源的指导？

主要发现

Almanac 在各专科领域相对于 ChatGPT 的平均事实性提升为 18 个百分点（p < 0.05）。
最大的事实性差距出现在心脏病学领域（Almanac 91% vs ChatGPT 69%）。
Almanac 使用内置计算器正确处理临床计算，回答所有计算情境；ChatGPT 在全部5个情境中没有计算器就出错。
面对对抗性提示，Almanac 展示出强安全性（95% 对 0%；当阈值未达到时，提示可导致回避）。
医生仍然更倾向于 ChatGPT 的输出，占比 57%，这凸显了用户体验方面的考虑，尽管具备扎根于安全性的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。