QUICK REVIEW

[论文解读] Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information

Seonwoo Min, Seunghyun Park|arXiv (Cornell University)|Nov 25, 2019

Genomics and Phylogenetic Studies参考文献 59被引用 26

一句话总结

本文提出PLUS，一种新颖的深度双向蛋白质序列表征预训练框架，整合了掩码语言建模与蛋白质特异性同家族预测任务。通过利用未标注序列中的结构和进化信息，PLUS-RNN在七个主要蛋白质生物学任务上优于现有模型，展现出更优的泛化能力和鲁棒性，尤其在长序列和复杂结构预测方面表现突出。

ABSTRACT

Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks. Most pre-training methods solely rely on language modeling and often exhibit limited performance. In this paper, we introduce a novel pre-training scheme called PLUS, which stands for Protein sequence representations Learned Using Structural information. PLUS consists of masked language modeling and a complementary protein-specific pre-training task, namely same-family prediction. PLUS can be used to pre-train various model architectures. In this work, we use PLUS to pre-train a bidirectional recurrent neural network and refer to the resulting model as PLUS-RNN. Our experiment results demonstrate that PLUS-RNN outperforms other models of similar size solely pre-trained with the language modeling in six out of seven widely used protein biology tasks. Furthermore, we present the results from our qualitative interpretation analyses to illustrate the strengths of PLUS-RNN. PLUS provides a novel way to exploit evolutionary relationships among unlabeled proteins and is broadly applicable across a variety of protein biology tasks. We expect that the gap between the numbers of unlabeled and labeled proteins will continue to grow exponentially, and the proposed pre-training method will play a larger role.

研究动机与目标

为应对未标注与已标注蛋白质序列之间日益增长的不平衡问题，开发一种半监督预训练方法。
通过整合超越标准语言建模的进化和结构关系，改进蛋白质表征学习。
设计一种互补的预训练任务——同家族预测，以捕捉蛋白质之间的功能和进化相似性。
在包括功能预测、结构预测和跨膜区检测在内的多样化下游蛋白质生物学任务中，评估PLUS的有效性。
证明结合结构信息进行预训练可实现比仅使用标准语言建模更优的泛化能力和性能。

提出的方法

该方法引入双重预训练目标：掩码语言建模（MLM）和同家族预测（SFP），联合优化蛋白质表征。
PLUS-RNN是一种使用大规模未标注蛋白质序列上的MLM和SFP任务进行预训练的双向RNN架构。
SFP任务通过预测两个蛋白质是否属于同一家族，促使模型学习成对表征，利用进化关系。
在预训练过程中，使用加权组合的MLM和SFP损失进行优化，其中超参数λ_PT控制两者相对重要性。
微调阶段采用MLM损失与任务特定损失的联合损失，其中λ_FT控制其权衡，以增强泛化能力。
该框架在多种架构上进行评估，包括RNN和Transformer（PLUS-TFM），展示了在不同模型类型上的可扩展性和鲁棒性。

实验结果

研究问题

RQ1将蛋白质特异性预训练任务与掩码语言建模相结合，能否提升下游蛋白质生物学任务的表征学习效果？
RQ2与仅使用语言建模相比，将同家族预测作为互补预训练目标是否能提升模型性能？
RQ3PLUS框架是否能在不同蛋白质序列长度上泛化，特别是在注意力机制模型上下文窗口限制之外的长蛋白质上？
RQ4结合结构和进化信息进行预训练在多大程度上提升了多样化的下游任务中的泛化能力和性能？
RQ5MLM和SFP任务的联合优化如何影响模型的鲁棒性及微调性能？

主要发现

PLUS-RNN在七个基准蛋白质生物学任务中的六个上优于所有仅通过语言建模预训练的可比模型，展现出更优的泛化能力。
同家族预测（SFP）任务显著提升了性能，尤其在与MLM结合时，且优于移除MLM的情况，表明其具有互补作用。
使用MLM和任务特定损失联合微调的性能始终优于仅使用任务特定损失，表明MLM起到了正则化作用。
PLUS-RNN在不同蛋白质长度上均保持强性能，而PLUS-TFM在长序列（>512个氨基酸）上出现性能下降，凸显了固定上下文注意力模型的局限性。
消融实验确认两种预训练任务均产生积极贡献，其中MLM影响更强，但SFP提供了关键的进化上下文信息。
结果表明，通过SFP利用进化关系可增强模型捕捉远缘相关蛋白质之间功能和结构相似性的能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。