Skip to main content
QUICK REVIEW

[论文解读] BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

Elliot Bolton, Abhinav Venigalla|arXiv (Cornell University)|Mar 27, 2024
Biomedical Text Mining and Ontologies被引用 31
一句话总结

BioMedLM 是一个2.7B参数的GPT风格模型,仅在PubMed的摘要和文章上进行训练,在微调后实现了具有竞争力的生物医学问答性能,同时支持设备端推理和开放数据来源。

ABSTRACT

Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance on a wide variety of biomedical NLP tasks. However, these models have hundreds of billions of parameters, are computationally expensive to run, require users to send their input data over the internet, and are trained on unknown data sources. Can smaller, more targeted models compete? To address this question, we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles. When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with much larger models, such as achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics. This demonstrates that smaller models can potentially serve as transparent, privacy-preserving, economical and environmentally friendly foundations for particular NLP applications, such as in biomedicine. The model is available on the Hugging Face Hub: https://huggingface.co/stanford-crfm/BioMedLM.

研究动机与目标

  • Motivate the development of a domain-specific, smaller LLM to address privacy, cost, and transparency concerns of large models in biomedicine.

提出的方法

  • Autoregressive decoder-only Transformer (GPT-2 style) with 2.7B parameters.
  • Domain-specific Byte-Pair Encoding tokenizer trained on PubMed abstracts to improve tokenization of biomedical terms.
  • Pre-training on PubMed abstracts and articles (34.6B tokens; 8.67 passes; ~300B tokens explored) using mixed-precision, bf16 for final training, and Decoupled AdamW optimizer.
  • Fine-tuning for downstream biomedical QA tasks with architecture specialized for multiple-choice prompts (per-task prompt shaping and a final linear classifier over answer scores).
  • Generation-style fine-tuning for consumer-health question answering (long-form responses) using web-derived QA pairs.
Figure 1: Train and Validation Loss after 100k Batches
Figure 1: Train and Validation Loss after 100k Batches

实验结果

研究问题

  • RQ1Can a compact, domain-specialized model (2.7B parameters) match or approach the performance of larger models on biomedical QA tasks?
  • RQ2Does training exclusively on PubMed data and using a biomedical tokenizer improve downstream task performance relative to general-domain baselines?
  • RQ3What are the trade-offs in privacy, cost, and accessibility when deploying a small, open biomedical LLM compared to closed, large models?

主要发现

数据集模型参数方法准确率
MedMCQAGPT-4few-shot72.4
MedMCQAFlan-PaLM540Bfew-shot57.6
MedMCQABioMedLM2.7Bfine-tune57.3
MedMCQAGalactica120Bzero-shot52.9
MedMCQAGPT-3.5175Bfew-shot51.0
MedQAMed-PaLM 2closed, few-shot85.4
MedQAGPT-4closed, few-shot81.4
MedQAFlan-PaLM540Bclosed, few-shot67.2
MedQABioMedLM (MedMCQA data + classifier)2.7Bfully open, fine-tune54.7
MedQAGPT-3.5175Bclosed, few-shot53.6
MedQABioMedLM (classifier)2.7Bfully open, fine-tune50.3
MedQADRAGON360Mfully open, fine-tune47.5
MedQABioLinkBERT340Mfully open, fine-tune45.1
MedQAGalactica120Bopen weights, zero-shot44.4
MedQAGPT-Neo 2.7B2.7Bfully open, fine-tune37.7
BioASQBioMedLM2.7Bfine-tune95.7
BioASQDRAGON360Mfine-tune96.4
BioASQBioLinkBERT340Mfine-tune94.9
BioASQGalactica120Bzero-shot94.3
BioASQGPT-Neo 2.7B2.7Bfine-tune67.1
PubMedQABioMedLM2.7Bfine-tune74.4
  • BioMedLM achieves competitive results on multiple biomedical QA benchmarks after fine-tuning, approaching or matching larger models in several tasks (e.g., MedMCQA 57.3%, MMLU Medical Genetics 69.0%).
  • Domain-specific pre-training on PubMed with a specialized tokenizer yields noticeable gains over GPT-2/tokenizer baselines (e.g., MedQA improvement from 33.05 to 34.98 at 125M scale).
  • Compared to GPT-Neo 2.7B trained on general English data, BioMedLM substantially outperforms on select QA tasks (e.g., 27 percentage point improvement on BioASQ).
  • BioMedLM supports on-device inference and can be fine-tuned on modest hardware while providing transparency about training data and architecture.
Figure 2: Comparison of GPT-Neo 2.7B and BioMedLM on Select QA Tasks
Figure 2: Comparison of GPT-Neo 2.7B and BioMedLM on Select QA Tasks

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。