QUICK REVIEW

[論文レビュー] Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers

Hadi Abdine, Michail Chatzianastasis|arXiv (Cornell University)|Jul 25, 2023

Machine Learning in Bioinformatics参考文献 45被引用数 13

ひとこと要約

Prot2Text は、グラフベースの構造表現とシーケンスモデルを encoder-decoder GNN+LLM フレームワークで融合し、SwissProt由来のマルチモーダルデータセットで評価した自由形式のタンパク質機能説明を生成します。

ABSTRACT

In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins. In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.

研究の動機と目的

タンパク質機能予測を固定ラベルではなく自由文本生成として再定式化する。
シーケンス、構造、テキスト注釈を統一したマルチモーダルエンコーダに統合する。
グラフとシーケンス情報の融合が機能説明を改善することを実証する。
ベンチマーク用の大規模で公開可能なマルチモーダルタンパク質データセットを提供する。
Prot2Text におけるモデルサイズ、性能、推論コストのトレードオフを評価する。

提案手法

AlphaFold 構造からの異種タンパク質グラフを、逐次的、空間的、そして水素結合エッジ型で構築する。
Relational Graph Convolutional Networks (RGCN) でグラフをエンコードして h_G を生成する。
事前学習済みの ESM2-35M モデルで配列をエンコードし、共通次元に射影する。
シーケンスとグラフ表現を、射影されたグラフ機能を各残基埋め込みに足し合わせる融合ブロックで融合し、次に射影と正規化を行う。
融合されたタンパク質表現に対して、交差注意機構を用いた GPT-2 ベースのトランスフォーマーデコーダーで自由文本のタンパク質説明をデコードする。
因果言語モデリング（CLM）で 256 トークンまでの説明を生成するよう訓練し、2つの追加トークンをシーケンス境界として GPT-2 トークナイザを使用する。

実験結果

リサーチクエスチョン

RQ1タンパク質構造とシーケンスのマルチモーダル融合は、詳細な自由文本のタンパク質機能の生成を可能にするか？
RQ2GNNベースの構造エンコーディングとタンパク質言語モデルの統合は、テキスト生成品質にどのような影響を与えるか？
RQ3どのデータセットと評価指標が、単一モードのベースラインに対する改善を最も示すか？
RQ4Prot2Text におけるモデルサイズが生成品質と推論時間に与える影響はどの程度か？
RQ5専用の融合機構は、タンパク質からテキスト生成におけるモダリティの単純な結合より優れているか？

主な発見

Model	＃ Params	BLEU Score	Rouge-1	Rouge-2	Rouge-L	BERT Score
vanilla-Transformer	225M	15.75	27.80	19.44	26.07	75.58
ESM2-35M	225M	32.11	47.46	39.18	45.31	83.21
RGCN	220M	21.63	36.20	28.01	34.40	78.91
RGCN + ESM2-35M	255M	30.39	45.75	37.38	43.63	82.51
RGCN × vanilla-Transformer	283M	27.97	42.43	34.91	40.72	81.12
Prot2Text BASE	283M	35.11	50.59	42.71	48.49	84.30
Prot2Text SMALL	256M	30.01	45.78	38.08	43.97	82.60
Prot2Text MEDIUM	398M	36.51	52.13	44.17	50.04	84.83
Prot2Text LARGE	898M	36.29	53.68	45.60	51.40	85.20

Prot2Text BASE は、評価されたモデルの中で最も高い BLEU (35.11)、Rouge-1 (50.59)、Rouge-2 (42.71)、Rouge-L (48.49)、および BERT Score (84.30) を達成。
RGCNと ESL2-35M のシーケンスエンコーダを組み合わせたマルチモーダルエンコーダは、単一モードベースライン（vanilla-Transformer、ESM2-35M）およびシンプルな融合アプローチを上回る。
より大きい Prot2Text バリアントはほとんどの指標を改善し、Prot2Text MEDIUM (398M) が適切な精度と推論時間のトレードオフを提供。
RGCN 単独は vanilla-Transformer を上回り、RGCN+ESM2-35M は vanilla 構成を著しく上回り、構造認識を含むシーケンス統合の価値を強調。
融合ブロック設計は重要で、単純な結合（RGCN + ESM2-25）は選択された融合アプローチを下回り、クロスモーダル相互作用機構の利点を示唆。
公開されているマルチモーダルデータセット（構造、シーケンス、説明） 256,690 タンパク質を含むデータセットがベンチマークおよび将来の研究を支持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。