QUICK REVIEW

[论文解读] PMC-LLaMA: Towards Building Open-source Language Models for Medicine

Chaoyi Wu, Lin, Weixiong|arXiv (Cornell University)|Apr 27, 2023

Topic Modeling被引用 56

一句话总结

PMC-LLaMA 是一个开源的、13B 参数的、以医学为焦点的语言模型，从 LLaMA 演化而来，通过以数据为中心的知识注入和医学指令微调构建，在若干医疗问答基准上超过 ChatGPT，同时保持轻量级。

ABSTRACT

Recently, Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding. While demonstrating proficiency in everyday conversations and question-answering situations, these models frequently struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. In this paper, we describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA. Our contributions are threefold: (i) we systematically investigate the process of adapting a general-purpose foundation language model towards medical domain, this involves data-centric knowledge injection through the integration of 4.8M biomedical academic papers and 30K medical textbooks, as well as comprehensive fine-tuning for alignment with domain-specific instructions; (ii) we contribute a large-scale, comprehensive dataset for instruction tuning. This dataset encompasses medical question-answering (QA), rationale for reasoning, and conversational dialogues, comprising a total of 202M tokens; (iii) we conduct thorough ablation studies to demonstrate the effectiveness of each proposed component. While evaluating on various public medical question-answering benchmarks, our lightweight PMCLLaMA, which consists of only 13 billion parameters, exhibits superior performance, even surpassing ChatGPT. All models, codes, datasets can be found in https://github.com/chaoyi-wu/PMC-LLaMA.

研究动机与目标

研究在数据中心的知识注入下，将通用型大语言模型适应于医学领域。
收集一个大型医学语料库（MedC-K）和一个医学指令微调数据集（MedC-I），以实现领域对齐。
在标准医学问答基准上评估 PMC-LLaMA，并分析消融以识别关键贡献组件。

提出的方法

通过在 4.8M 篇生物医学论文和 3 万本教材上训练来注入医学知识，形成 MedC-K。
在 MedC-I（202M tokens）上进行医学特定的指令微调，以使其与临床用例对齐。
采用两阶段训练：知识注入（自回归损失），随后进行指令微调（I、R 格式、推理理由和知识图谱提示）。
在指令微调中结合三种数据源：医学会话数据、推理理由问答、以及知识图谱提示（UMLS）。
使用三个公开医学问答基准（PubMedQA、MedMCQA、USMLE）进行评估，并在模型规模、数据注入和指令微调方面进行消融分析。

实验结果

研究问题

RQ1在数据中心的知识注入和医学特定指令微调后，13B 开源 LLM 是否能够在医学问答中与更大规模的闭源模型（如 ChatGPT）相比甚至超越？
RQ2不同组件（论文/书籍知识注入、推理理由问答、会话数据和知识图谱提示）对医学问答表现有何贡献？
RQ3模型规模和训练方案对医学问答基准有何影响？

主要发现

模型规模	知识注入（论文）	知识注入（书籍）	指令微调（推理理由）	MedQA	MedMCQA	PubMedQA	平均值
7B Baseline LLaMA	%	%	%	44.54	48.51	73.40	-
13B Baseline LLaMA	%	%	%	45.48	51.42	76.40	-
PMC-LLaMAK 7B	!	%	%	44.70	50.54	69.50	-
PMC-LLaMAK 7B (with Rationale)	!	!	%	45.56	51.45	74.60	-
PMC-LLaMAK 13B	%	%	%	48.15	54.15	77.10	-
PMC-LLaMA 13B (initial)	!	!	!	49.32	54.56	77.20	-
PMC-LLaMA 13B (full setup)	!	!	!	56.36	56.04	77.90	-

在 MedQA、MedMCQA、PubMedQA 上的平均问答正确率为 64.43，超过若干基线，前述的全面医学知识注入与指令微调的 PMC-LLaMA-13B。
在消融实验中，增加书籍知识在 MedQA、MedMCQA、PubMedQA 上的提升约为 1.0–2.9 点；增加论文知识也带来增益；较大模型规模（13B）比 7B 有所提升。
在指令微调过程中纳入医学推理理由问答和知识图谱提示可提升性能（如 MedQA：从 49.32% 提升至 54.43%，知识图谱再增约 1.93%）；会话数据加上推理理由数据显著提升零样本问答。
在评估基准上，完整设置的 PMC-LLaMA-13B 的平均问答正确率高于 ChatGPT（在 MedQA/MedMCQA/PubMedQA 的平均值为 64.43，ChatGPT 为 54.97）。
该模型保持开源，作者的 GitHub 上发布了代码和数据。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。