[论文解读] Rationale-Guided Retrieval Augmented Generation for Medical Question Answering
RAG 2 引入 rationale-guided filtering、rationale-based query formulation,以及 balanced retrieval,以提升医疗问答的准确性;它在基线和模型规模上持续提升LLM的准确性。
Large language models (LLM) hold significant potential for applications in biomedicine, but they struggle with hallucinations and outdated knowledge. While retrieval-augmented generation (RAG) is generally employed to address these issues, it also has its own set of challenges: (1) LLMs are vulnerable to irrelevant or incorrect context, (2) medical queries are often not well-targeted for helpful information, and (3) retrievers are prone to bias toward the specific source corpus they were trained on. In this study, we present RAG$^2$ (RAtionale-Guided RAG), a new framework for enhancing the reliability of RAG in biomedical contexts. RAG$^2$ incorporates three key innovations: a small filtering model trained on perplexity-based labels of rationales, which selectively augments informative snippets of documents while filtering out distractors; LLM-generated rationales as queries to improve the utility of retrieved snippets; a structure designed to retrieve snippets evenly from a comprehensive set of four biomedical corpora, effectively mitigating retriever bias. Our experiments demonstrate that RAG$^2$ improves the state-of-the-art LLMs of varying sizes, with improvements of up to 6.1\%, and it outperforms the previous best medical RAG model by up to 5.6\% across three medical question-answering benchmarks. Our code is available at https://github.com/dmis-lab/RAG2.
研究动机与目标
- 通过将检索与生成集成,解决生物医学大语言模型中的幻觉与过时知识。
- 通过在理由困惑度差异上训练一个小型过滤模型,降低检索偏差和干扰项。
- 通过将LLM生成的推理理由用作查询来提升问答的实用性。
- 在四个生物医学语料库之间推动平衡的证据来源,以减少语料偏见。
提出的方法
- 使用基于困惑度的标签训练一个基于 Flan-T5 的小型过滤器,对比有/无检索文档的推理理由。
- 将LLM生成的推理理由用作检索证据的提示(基于理由的查询)。
- 从四个语料库(PubMed、PMC、教科书、临床指南)检索等数量的片段以实现来源平衡。
- 在平衡检索后应用重新排序器(MedCPT)以精炼片段相关性。
- 采用单次生成进行评估,以避免迭代的高成本过程。
实验结果
研究问题
- RQ1基于理由的过滤是否能提升检索片段对基础LLM的有用性?
- RQ2基于理由的查询是否在各医疗基准上提升证据的有效性和问答性能?
- RQ3平衡检索是否降低检索器偏差并改善跨语料覆盖?
- RQ4RAG 2 对不同骨干LLM与医疗问答数据集的影响是什么?
主要发现
| 模型 | MedQA | MedMCQA | MMLU-Med | 平均 |
|---|---|---|---|---|
| Llama-3-8B-Instruct + RAG 2 | 64.6 | 59.4 | 74.8 | 66.3 |
| Meerkat-7B + RAG 2 | 75.6 | 63.0 | 78.7 | 72.4 |
| GPT-4o + RAG 2 | 91.1 | 77.2 | 92.5 | 86.9 |
- RAG 2 在骨干LLM上实现了平均准确率提升最高6.1%。
- 在三个医疗问答基准上,RAG 2 的表现超越之前的医疗RAG模型,最高提升5.6%。
- RAG 2 对开源、医疗和商业LLM均有显著提升(例如,GPT-4o 显示出显著提升)。
- 平衡检索在主要基准上持续超越 MedRAG。
- 消融研究表明基于理由的过滤和基于理由的查询对性能提升贡献显著。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。