QUICK REVIEW

[论文解读] Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers

Hadi Abdine, Michail Chatzianastasis|arXiv (Cornell University)|Jul 25, 2023

Machine Learning in Bioinformatics参考文献 45被引用 13

一句话总结

Prot2Text 通过将基于图的结构表示与序列模型融合在一个 encoder-decoder GNN+LLM 框架中生成自由文本蛋白功能描述，在 SwissProt 派生的多模态数据集上进行评估。

ABSTRACT

In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches. However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins. In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions. To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.

研究动机与目标

将蛋白质功能预测重新表述为自由文本生成，而不是固定标签。
在一个统一的多模态编码器中整合序列、结构和文本注释。
证明融合图结构和序列信息能够改善功能描述。
提供一个大规模、公开可用的多模态蛋白数据集用于基准测试。
评估模型规模、性能和推理成本之间的权衡。

提出的方法

使用 AlphaFold 结构构建一个包含序列、空间和氢键边类型的异构蛋白图。
使用关系图卷积网络(RGCN)对图进行编码，产生 h_G。
使用预训练的 ESM2-35M 模型对序列进行编码并投影到一个公共维度。
通过一个融合块将投影后的图特征加到每个残基嵌入上来融合序列和图表示，随后进行投影和归一化。
使用基于 GPT-2 的变换器解码器，具备对融合后的蛋白质表示的跨注意力，解码自由文本蛋白描述。
使用因果语言建模（CLM）进行训练，以生成最多 256 token 的描述；使用 GPT-2 分词器并添加两个用于序列边界的标记。

实验结果

研究问题

RQ1蛋白质结构和序列的多模态融合是否能实现详细的自由文本蛋白功能生成？
RQ2将基于 GNN 的结构编码与蛋白质语言模型整合对文本生成质量有何影响？
RQ3哪些数据集和评估指标最能体现相对于单模态基线的改进？
RQ4Prot2Text 中模型规模对生成质量和推理时间的影响？
RQ5相较于简单拼接，专用的融合机制在蛋白质到文本生成中是否更优？

主要发现

模型	# 参数	BLEU 得分	Rouge-1	Rouge-2	Rouge-L	BERT 分数
vanilla-Transformer	225M	15.75	27.80	19.44	26.07	75.58
ESM2-35M	225M	32.11	47.46	39.18	45.31	83.21
RGCN	220M	21.63	36.20	28.01	34.40	78.91
RGCN + ESM2-35M	255M	30.39	45.75	37.38	43.63	82.51
RGCN × vanilla-Transformer	283M	27.97	42.43	34.91	40.72	81.12
Prot2Text BASE	283M	35.11	50.59	42.71	48.49	84.30
Prot2Text SMALL	256M	30.01	45.78	38.08	43.97	82.60
Prot2Text MEDIUM	398M	36.51	52.13	44.17	50.04	84.83
Prot2Text LARGE	898M	36.29	53.68	45.60	51.40	85.20

Prot2Text BASE 在评估的模型中取得最高的 BLEU（35.11）、Rouge-1（50.59）、Rouge-2（42.71）、Rouge-L（48.49）和 BERT Score（84.30）。
将 RGCN 与 ESM2-35M 序列编码器相结合的多模态编码器，超越单模态基线（vanilla-Transformer、ESM2-35M）和简单融合方法。
较大版本的 Prot2Text 提升了大多数指标，Prot2Text MEDIUM（398M）在准确性与时间之间提供较优的权衡。
仅 RGCN 就优于 vanilla-Transformer，而 RGCN+ESM2-35M 显著超越了 vanilla 配置，凸显了结构感知序列整合的价值。
融合块设计至关重要；简单拼接（RGCN + ESM2-25）的表现不如所选的融合方法，暗示跨模态交互机制的好处。
一个公开发布的含 256,690 个蛋白质（结构、序列、描述）的多模态数据集，支持基准测试和未来研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。