QUICK REVIEW

[論文レビュー] BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine

Yizhen Luo, Jiahuan Zhang|arXiv (Cornell University)|Aug 18, 2023

Machine Learning in Bioinformatics被引用数 27

ひとこと要約

BioMedGPTは、生物医学のためのオープンなマルチモーダル生成トランスフォーマーを導入し、分子、タンパク質、自然言語を統一された特徴空間に整列します。BioMedGPT-10Bは強力な生物医学QA性能を実現し、モデルとマルチモーダルデータセットをオープンソース化します。

ABSTRACT

Foundation models (FMs) have exhibited remarkable performance across a wide range of downstream tasks in many domains. Nevertheless, general-purpose FMs often face challenges when confronted with domain-specific problems, due to their limited access to the proprietary training data in a particular domain. In biomedicine, there are various biological modalities, such as molecules, proteins, and cells, which are encoded by the language of life and exhibit significant modality gaps with human natural language. In this paper, we introduce BioMedGPT, an open multimodal generative pre-trained transformer (GPT) for biomedicine, to bridge the gap between the language of life and human natural language. BioMedGPT allows users to easily ``communicate'' with diverse biological modalities through free text, which is the first of its kind. BioMedGPT aligns different biological modalities with natural language via a large generative language model, namely, BioMedGPT-LM. We publish BioMedGPT-10B, which unifies the feature spaces of molecules, proteins, and natural language via encoding and alignment. Through fine-tuning, BioMedGPT-10B outperforms or is on par with human and significantly larger general-purpose foundation models on the biomedical QA task. It also demonstrates promising performance in the molecule QA and protein QA tasks, which could greatly accelerate the discovery of new drugs and therapeutic targets. In addition, BioMedGPT-LM-7B is the first large generative language model based on Llama2 in the biomedical domain, therefore is commercial friendly. Both BioMedGPT-10B and BioMedGPT-LM-7B are open-sourced to the research community. In addition, we publish the datasets that are meticulously curated for the alignment of multi-modalities, i.e., PubChemQA and UniProtQA. All the models, codes, and datasets are available at \url{https://github.com/PharMolix/OpenBioMed}.

研究の動機と目的

生物の言語を、バイオ医薬データでファインチューニングした大規模言語モデルを用いて人間の自然言語と橋渡しする。
独立したエンコーダを介して、テキスト、分子、タンパク質のモダリティを統合し、共通の特徴空間へ整列する。
生物医学QA、分子QA、タンパク質QAタスクでBioMedGPT-10Bを実証し、モダリティ整合のためのデータセットを公開する。

提案手法

大規模な生物医学コーパスで Llama2-Chat-7B をファインチューニングして BioMedGPT-LM-7B を作成する。
モダリティ適応器を介して、2D 分子グラフとタンパク質配列を自然言語空間と整列させることで BioMedGPT-10B を構築する。
分子エンコーダとして GraphMVP、タンパク質エンコーダとして ESM-2 を、独立したモダリティ適応器とともに使用する。
役割ベースのプロンプトを用いて、PubChemQA と UniProtQA の2つの厳選データセットを用いたマルチモーダル微調整を実施する。
計算資源を節約し忘却を避けるために、BioMedGPT-LM のパラメータを凍結し、分子/タンパク質エンコーダおよび適応器を訓練する（mPLUG-owl 相当のアプローチ）。
生物医療QAベンチマーク（MedMCQA、PubMedQA、USMLE）、分子QA（ChEBI-20）、タンパク質QA（UniProtQA）を用いて評価し、BLEU/ROUGE/METEOR 指標を報告する。

実験結果

リサーチクエスチョン

RQ11つの大規模生物医学言語モデルは、複数のモダリティ（分子、タンパク質、テキスト）を効果的に整列・推論できるか？
RQ2ファインチューニングと専用のマルチモーダル整合が、汎用のLLMを超える生物医療QAタスクの性能を向上させるか？
RQ3データが統一されたマルチモーダル空間で提示された場合、分子QAとタンパク質QAの能力はベースライン言語モデルとどう比較されるか？
RQ4生物医学におけるマルチモーダル整合を最も効果的に支援するデータセットとプロンプト戦略は何か？

主な発見

方法	設定	MedMCQA(ID)	PubMedQA(ID)	USMLE(OOD)
BioMedGPT-10B	微調整	51.4	76.1	50.4
Llama2-Chat	微調整	48.3	75.5	45.3
PMC-Llama	0	50.5	69.5	44.7
BioMedGPT-10B (ours)	微調整	51.4	76.1	50.4

BioMedGPT-10B は、生物医学 QA ベンチマーク（MedMCQA、PubMedQA）で、はるかに大規模なモデルと同等またはそれ以上の結果を達成し、外域データの USMLE ではベースラインを上回る。
PubMedQA では、BioMedGPT-10B が人間専門家に近い性能を達成。
分子QAでは、整列を用いる場合、BioMedGPT-10B は ChatGPT および Llama2-7B-Chat を BLEU-2、BLEU-4、ROUGE 指標で顕著に上回る。
タンパク質QAでは、整列を用いた BioMedGPT-10B が強力な BLEU/ROUGE 結果を出し、ベースラインを顕著に上回り、タンパク質配列データと自然言語の統合が効果的であることを示す。
BioMedGPT-LM-7B は Llama2 系生成モデルの最初の生物医学版でありオープンソース化されている。BioMedGPT-10B もオープンソース化されている。
著者はマルチモーダル整合研究を促進するため PubChemQA と UniProtQA データセットを公開している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。