QUICK REVIEW

[论文解读] MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation

Seon‐Ok Kim|ArXiv.org|Feb 5, 2025

Biomedical Text Mining and Ontologies被引用 5

一句话总结

MedBioLM 将领域特定微调与检索增强生成（RAG）结合，以提升在封闭式、长篇和短篇任务的生物医学问答的准确性，在关键基准上超越基础模型，并显示RAG在基于检索的查询中的事实性增强。

ABSTRACT

Large Language Models (LLMs) have demonstrated impressive capabilities across natural language processing tasks. However, their application to specialized domains such as medicine and biology requires further optimization to ensure factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a domain-adapted biomedical question-answering model designed to enhance both short-form and long-form queries. By integrating fine-tuning and retrieval-augmented generation (RAG), MedBioLM dynamically incorporates domain-specific knowledge, improving reasoning abilities and factual accuracy. To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA datasets, covering structured multiple-choice assessments and complex clinical reasoning tasks. Fine-tuning significantly improves accuracy on benchmark datasets, while RAG enhances factual consistency. These results highlight the potential of domain-optimized LLMs in advancing biomedical research, medical education, and clinical decision support.

研究动机与目标

在利用领域自适应LLMs的情况下，提升生物医学QA的事实准确性、可靠性与情境深度。
评估微调、RAG与提示工程在多种QA格式（封闭式、长篇、短篇）中的影响。
在多样的生物医学QA数据集上评估性能，并确定在何种条件下各优化策略最有帮助。

提出的方法

在多样化的QA数据集上对生物医学LLM进行微调，以提升领域特定推理与事实准确性。
将检索增强生成（RAG）与基于关键词的结构化索引结合，以实现精确的外部知识检索。
应用提示工程，根据QA格式（封闭式、长篇、短篇）定制系统提示和解码参数。
使用Azure-based基础设施实现可扩展的微调和推理优化。
使用封闭式准确性与文本生成指标（ROUGE、BLEU、BERTScore、BLEURT）在不同数据集上进行评估。
与包括GPT-4o、GPT-4和GPT-3.5在内的基础模型进行对比，以量化微调和RAG带来的收益。

Figure 1: Comparative performance of MedBioLM and base models on closed-ended and short-form biomedical QA tasks, highlighting the benefits of fine-tuning.

实验结果

研究问题

RQ1领域特定微调如何影响封闭式生物医学QA数据集（MedQA、PubMedQA、BioASQ）上的准确性？
RQ2检索增强生成（RAG）对生物医学QA的事实性准确性和词汇相似性有何影响？
RQ3提示工程与解码参数如何影响短篇和长篇生物医学答案的质量？
RQ4微调模型是否在多种QA格式和数据集上优于基础模型，在何种条件下RAG能带来额外价值？
RQ5在生物医学QA中，相较于GPT-4和GPT-3.5，GPT-4o 是否从领域适配中受益？

主要发现

数据集	MedBioLM	GPT-4o	GPT-4o-mini	GPT-4	GPT-3.5
MedQA	88.0	87.0	70.4	81.71	50.51
PubMedQA	78.9	44.74	77.55	70.0	19.30
BioASQ	96.0	92.0	92.0	96.0	88.0

微调后的 MedBioLM 在 MedQA 上达到 88.0% 的准确率，在 PubMedQA 上为 78.9%，在 BioASQ 上为 96.0%，在 MedQA 和 PubMedQA 上优于 GPT-4o 与 GPT-3.5，BioASQ 的表现接近完美。
RAG 提升短篇QA的指标，增加 ROUGE-1 以及其他词汇相似性指标，尽管整体而言微调对短篇与长篇输出的影响更大。
在 MedicationQA 的长篇QA 中，微调带来显著提升（ROUGE-1: 24.69；BLEU: 2.49；BERTScore: 8.98），而 LiveQA 的结果在某些情况下提示潜在的过拟合。
短篇QA 的结果显示微调后的 GPT-4o 显著优于基础模型（ROUGE-1: 43.17 对比 4.35；BLEU: 11.55 对比 0.28），在应用微调时，RAG 提供的额外好处有限。
对比性成对评估显示 GPT-4o 常常具有更高的整体准确性，而 MedBioLM 在某些情况下在连贯性与简练性方面表现出色，显示出互补优势。
BLEURT 分数在长篇生成中对各模型总体仍偏负，表明在生成类人类长篇回答方面仍存在挑战。

Figure 2: Overview of our approach for optimizing large language models (LLMs) in biomedical question answering, integrating fine-tuning, retrieval-augmented generation (RAG), and prompt engineering to enhance performance across different QA formats.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。