QUICK REVIEW

[논문 리뷰] BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

Yizhen Luo, Jiahuan Zhang|arXiv (Cornell University)|2023. 08. 18.

Machine Learning in Bioinformatics인용 수 27

한 줄 요약

BioMedGPT는 분자, 단백질, 자연어를 통합 특성 공간에서 정렬하는 생물의학용 오픈 멀티모달 생성 트랜스포머를 소개합니다. BioMedGPT-10B는 강력한 생물의학 QA 성능을 달성하고 모델과 멀티모달 데이터세트를 오픈 소스로 제공합니다.

ABSTRACT

Foundation models (FMs) have exhibited remarkable performance across a wide range of downstream tasks in many domains. Nevertheless, general-purpose FMs often face challenges when confronted with domain-specific problems, due to their limited access to the proprietary training data in a particular domain. In biomedicine, there are various biological modalities, such as molecules, proteins, and cells, which are encoded by the language of life and exhibit significant modality gaps with human natural language. In this paper, we introduce BioMedGPT, an open multimodal generative pre-trained transformer (GPT) for biomedicine, to bridge the gap between the language of life and human natural language. BioMedGPT allows users to easily ``communicate'' with diverse biological modalities through free text, which is the first of its kind. BioMedGPT aligns different biological modalities with natural language via a large generative language model, namely, BioMedGPT-LM. We publish BioMedGPT-10B, which unifies the feature spaces of molecules, proteins, and natural language via encoding and alignment. Through fine-tuning, BioMedGPT-10B outperforms or is on par with human and significantly larger general-purpose foundation models on the biomedical QA task. It also demonstrates promising performance in the molecule QA and protein QA tasks, which could greatly accelerate the discovery of new drugs and therapeutic targets. In addition, BioMedGPT-LM-7B is the first large generative language model based on Llama2 in the biomedical domain, therefore is commercial friendly. Both BioMedGPT-10B and BioMedGPT-LM-7B are open-sourced to the research community. In addition, we publish the datasets that are meticulously curated for the alignment of multi-modalities, i.e., PubChemQA and UniProtQA. All the models, codes, and datasets are available at \url{https://github.com/PharMolix/OpenBioMed}.

연구 동기 및 목표

생명의 언어를 생물의학 데이터로 미세조정된 대형 언어 모델을 사용해 인간의 자연어와 연결한다.
독립 인코더를 통해 텍스트, 분자, 단백질 모달리티를 통합하고 공유 특성 공간으로 정렬한다.
BioMedGPT-10B를 생물의학 QA, 분자 QA, 단백질 QA 작업에 시연하고 모달리티 정렬을 위한 데이터세트를 공개한다.

제안 방법

대규모 생물의학 말뭉치에서 Llama2-Chat-7B를 미세조정하여 BioMedGPT-LM-7B를 만든다.
2D 분자 그래프와 단백질 서열을 자연어 공간과 모달리티 어댑터를 통해 정렬하여 BioMedGPT-10B를 구축한다.
GraphMVP를 분자 인코더로, ESM-2를 단백질 인코더로 사용하고 독립 모달리티 어댑터를 적용한다.
PubChemQA와 UniProtQA의 두 가지 큐레이션 데이터셋을 사용하여 역할 기반 프롬프트로 모델을 안내하는 멀티모달 미세조정을 수행한다.
BioMedGPT-LM 매개변수를 고정하고 분자/단백질 인코더 및 어댑터를 학습시켜 (mPLUG-owl 유사 방식) 계산 소모를 줄이고 망각을 피한다.
메타 및 생물의학 QA 벤치마크(MedMCQA, PubMedQA, USMLE), 분자 QA(ChEBI-20), 단백질 QA(UniProtQA)로 평가하고 BLEU/ROUGE/METEOR 지표를 보고한다.

실험 결과

연구 질문

RQ1단일 대형 생물의학 언어모델이 여러 모달리티(분자, 단백질, 텍스트)를 효과적으로 정렬하고 추론할 수 있는가?
RQ2미세조정과 전용 멀티모달 정렬이 일반 목적 LLM을 넘어 생물의학 QA 작업의 성능을 향상시키는가?
RQ3데이터가 통합된 멀티모달 공간에 제시될 때 분자 QA와 단백질 QA 능력이 기본 언어 모델에 비해 얼마나 비교되는가?
RQ4생물의학에서 멀티모달 정렬을 가장 잘 지원하는 데이터세트와 프롬프트 전략은 무엇인가?

주요 결과

Method	Setting	MedMCQA(ID)	PubMedQA(ID)	USMLE(OOD)
BioMedGPT-10B	Fine-tuning	51.4	76.1	50.4
Llama2-Chat	Fine-tuning	48.3	75.5	45.3
PMC-Llama	0	50.5	69.5	44.7
BioMedGPT-10B (ours)	Fine-tuning	51.4	76.1	50.4

BioMedGPT-10B는 생물의학 QA 벤치마크(MedMCQA, PubMedQA)에서 훨씬 큰 모델과 동등한 수준의 최첨단 또는 동등한 성과를 달성하고, 도메인 외 USMLE에서 기본선 baselines를 능가한다.
PubMedQA에서 BioMedGPT-10B는 인간 전문가 수준의 성능에 도달한다.
분자 QA에서 BioMedGPT-10B는 정렬을 사용할 때 BLEU-2, BLEU-4, ROUGE 지표에서 ChatGPT 및 Llama2-7B-Chat을 크게 능가한다.
단백질 QA에서 정렬을 사용하는 BioMedGPT-10B는 강한 BLEU/ROUGE 결과를 내고 기초 모델을 뚜렷이 능가하여 단백질 서열 데이터와 자연어의 효과적인 통합을 보여준다.
BioMedGPT-LM-7B는 Llama2 기반 생성 모델 중 최초의 생물의학 버전이며 오픈소스이며; BioMedGPT-10B 역시 오픈소스이다.
저자들은 멀티모달 정렬 연구를 촉진하기 위해 PubChemQA와 UniProtQA 데이터세트를 공개한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.