QUICK REVIEW

[论文解读] Me LLaMA: Foundation Large Language Models for Medical Applications

Qianqian Xie, Qingyu Chen|arXiv (Cornell University)|Feb 20, 2024

Machine Learning in Healthcare被引用 7

一句话总结

Me-LLaMA 是一个医学领域的语言模型家族，基于开源的 LLaMA 模型构建，通过领域特定的预训练和指令调优进行优化，以提升医疗文本分析和诊断能力，在多项设置中对比开放模型具有强零-shot、有监督和复杂病例表现，并在多种场景中与 ChatGPT/GPT-4 竞争。

ABSTRACT

Recent advancements in large language models (LLMs) like ChatGPT and LLaMA show promise in medical applications, yet challenges remain in medical language comprehension. This study presents Me-LLaMA, a new medical LLM family based on open-source LLaMA models, optimized for medical text analysis and diagnosis by leveraging large-scale, domain-specific datasets. The Me-LLaMA family, including foundation models Me-LLaMA 13/70B and their chat-enhanced versions, was developed through continued pre-training and instruction tuning with 129B tokens and 214K samples from biomedical and clinical sources. Training the 70B models required over 100,000 A100 GPU hours. Me-LLaMA's performance was evaluated across six medical text analysis tasks using 12 benchmark datasets and complex clinical case diagnosis, with automatic and human evaluations. Results indicate Me-LLaMA outperforms LLaMA and other open-source medical LLMs in zero-shot and supervised settings. Task-specific tuning further boosts performance, surpassing ChatGPT on 7 of 8 datasets and GPT-4 on 5 of 8. For complex clinical cases, Me-LLaMA achieves performance comparable to ChatGPT and GPT-4. This work underscores the importance of domain-specific data in developing medical LLMs and addresses the high computational costs involved in training, highlighting a balance between pre-training and fine-tuning strategies. Me-LLaMA models are now accessible under user agreements, providing a valuable resource for advancing medical AI.

研究动机与目标

推动在医学领域需要领域特定的 LLM，以提升语言理解和诊断支持。
开发一个医学领域的 LLM 家族（Me-LLaMA 13B/70B），在生物医学/临床数据上进行持续的预训练和指令调优。
在多项医学文本分析任务和复杂临床病例诊断上，使用自动评估和人工评估来评估性能。

提出的方法

在领域特定数据（总计129B token）上对 Me-LLaMA 基础模型（13B 和 70B）进行预训练。
通过在214K个生物医学/临床样本上的指令调优，创建具备聊天增强能力的版本。
为训练分配大量计算资源（70B 模型的 A100 GPU 小时超过 100,000）。
在六项医学文本分析任务上进行评估，使用12个基准数据集以及复杂临床病例诊断。
将零-shot与有监督性能与 LLaMA 及其他开源医学 LLMs 对比，在任务特定调优后与 ChatGPT 和 GPT-4 对比。

实验结果

研究问题

RQ1在核心医学文本分析任务上，经过生物医学/临床数据领域自适应训练的 LLM 能否超过通用的开源医学 LLM？
RQ2与零-shot 设置相比，任务特定指令调优如何影响在医学基准上的性能？
RQ3Me-LLaMA 模型是否能够在多份医学数据集以及复杂临床情境中达到或超过最先进的封闭模型（ChatGPT、GPT-4）？

主要发现

Me-LLaMA 在六项医学文本分析任务中，在零-shot和有监督设置下均优于 LLaMA 和其他开源医学 LLM。
任务特定调优进一步提升性能，在8个数据集中的7个上超越了 ChatGPT。
在调优后，Me-LLaMA 在8个数据集中的5个上超过了 GPT-4。
在复杂临床病例方面，Me-LLaMA 的表现达到与 ChatGPT 和 GPT-4 相当的水平。
该工作强调领域特定数据的价值，并讨论预训练规模与微调成本之间的权衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。