QUICK REVIEW

[论文解读] JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

Junda Wang, Zhichao Yang|arXiv (Cornell University)|Feb 27, 2024

Artificial Intelligence in Healthcare and Education被引用 11

一句话总结

本文介绍 Joint Medical LLM and Retrieval Training (JMLR)，一个同步的检索模型与大型语言模型训练框架，能够在提升医学问答与推理能力的同时缩短训练时间。它在以 7B 和 13B Llama 为基础的模型上，医疗基准测试实现了开源领域的最新状态（state-of-the-art）结果。

ABSTRACT

Large Language Models (LLMs) have demonstrated a remarkable potential in medical knowledge acquisition and question-answering. However, LLMs can potentially hallucinate and yield factually incorrect outcomes, even with domain-specific pretraining. Previously, retrieval augmented generation (RAG) has limited success in addressing hallucinations. Unlike previous methods in RAG where the retrieval model was trained separately from the LLM, we introduce JMLR (for Jointly trains LLM and information Retrieval) during the fine-tuning phase. The synchronized training mechanism enhances JMLR's ability to retrieve clinical guidelines and leverage medical knowledge to reason and answer questions and reduces the demand for computational resources. We evaluated JMLR on the important medical question-answering application. Our experimental results demonstrate that JMLR-13B (70.5%) outperforms a previous state-of-the-art open-source model using conventional pre-training and fine-tuning Meditron-70B (68.9%) and Llama2-13B with RAG (67.7%) on a medical question-answering dataset. Comprehensive evaluations reveal JMLR-13B enhances reasoning quality and reduces hallucinations better than Claude3-Opus. Additionally, JMLR-13B (148 GPU hours) also trains much faster than Meditron-70B (42630 GPU hours). Through this work, we provide a new and efficient knowledge enhancement method for healthcare, demonstrating the potential of integrating retrieval and LLM training for medical question-answering systems.

研究动机与目标

以领域特定知识提升医学问答与推理能力的动机。
通过将LLM与检索的医学指南和文本进行锚定来解决幻觉问题。
开发一种可同时更新检索器与LLM的联合训练范式，以实现更好的对齐。
评估与传统预训练+微调流程相比的效率提升与性能。

提出的方法

使用 Shifted Sparse Attention (S2-Attn) 来处理长输入上下文。
采用基于 ColBERT 的检索器，并具备联合 LLM-检索器训练目标（LLM-Rank 损失）。
在 AMBOSS 和 USMLE 的问答对上进行训练，将最相关的检索文档输入给 LLM。
计算由 LLM 驱动的损失，并通过反映 LLM 提升的基于排序的信号来更新检索器参数。
每次迭代抽取前 30 条检索文档，并将前 7 条输入给 LLM 以生成答案与推理。
比较一体化的 JMLR 与分离的 RAG，以及在 7B 与 13B Llama 模型上的基线预训练/微调效果。

实验结果

研究问题

RQ1将检索器与 LLM 的训练同步是否能在医学问答的准确性与推理方面优于传统的预训练-微调和 RAG 基线？
RQ2在保持或超越医学基准的最先进性能的同时，JMLR 是否能降低训练时间和资源消耗？
RQ3JMLR 如何影响医学问答中的幻觉倾向与可解释性？
RQ4模型规模（7B 对 13B）对 JMLR 的性能与效率有何影响？

主要发现

Dataset	PMC-Llama-7B	Llama-2-7B	Meditron-7B	JMLR-7B	MedMCQA	MedQA	AMBOSS
MMLU-Medical	59.7	56.3	55.6	57.2	57.6	51.7	68.7
MedMcQA	57.6	54.4	59.2	61.3	57.6	61.3	68.7
MedQA	42.4	44.0	47.9	51.7	42.4	51.7	68.7
AMBOSS	43.7	46.5	50.1	68.7	46.5	68.7	81.2

JMLR-13B 在 AMBOSS 上达到 81.2%，在 MedQA 上达到 61.3%，超越这些数据集上的 Meditron-70B 和 ChatGPT。
JMLR-7B 在 AMBOSS 上达到 68.7%，在 MedQA 上达到 51.7%，超越了若干公开基线。
使用 JMLR 的训练时间为 37 小时，显著短于传统的预训练（127h）再加微调（17h）。
JMLR-7B 与 JMLR-13B 在 MMLU-Medical、MedMCQA、MedQA、AMBOSS 基准上均显示出强劲结果，表明在医学推理与问答能力上有所提升。
GPT-4 与三位医生独立评估 JMLR-13B 的推理在多数情形中更优（GPT-4 胜率 0.63；专家 0.60）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。