Skip to main content
QUICK REVIEW

[论文解读] Deep transfer learning in the assessment of the quality of protein models

David Menéndez Hurtado, Karolis Uziela|arXiv (Cornell University)|Apr 17, 2018
Protein Structure and Dynamics被引用 36
一句话总结

该论文提出了一种基于序列预测所得的最小化结构特征的深度迁移学习框架,用于蛋白质模型质量评估。通过利用预训练的卷积网络和一种三头结构(tricephalous architecture)来编码比较排名,该方法在输入复杂度降低的情况下实现了最先进(SOTA)的性能,尽管输入结构较为粗糙,但在全局得分预测和目标排序方面仍优于现有模型。

ABSTRACT

MOTIVATION: Proteins fold into complex structures that are crucial for their biological functions. Experimental determination of protein structures is costly and therefore limited to a small fraction of all known proteins. Hence, different computational structure prediction methods are necessary for the modelling of the vast majority of all proteins. In most structure prediction pipelines, the last step is to select the best available model and to estimate its accuracy. This model quality estimation problem has been growing in importance during the last decade, and progress is believed to be important for large scale modelling of proteins. The current generation of model quality estimation programs performs well at separating incorrect and good models, but fails to consistently identify the best possible model. State-of-the-art model quality assessment methods use a combination of features that describe a model and the agreement of the model with features predicted from the protein sequence. RESULTS: We first introduce a deep neural network architecture to predict model quality using significantly fewer input features than state-of-the-art methods. Thereafter, we propose a methodology to train the deep network that leverages the comparative structure of the problem. We also show the possibility of applying transfer learning on databases of known protein structures. We demonstrate its viability by reaching state-of-the-art performance using only a reduced set of input features and a coarse description of the models. AVAILABILITY: The code will be freely available for download at github.com/ElofssonLab/ProQ4.

研究动机与目标

  • 为解决在大规模结构生物信息学工作流中从多个预测结果中选择最佳蛋白质模型的挑战。
  • 通过仅使用序列预测的属性作为输入,减少对复杂结构特征的依赖。
  • 通过迁移学习和结构化深度学习架构提升模型质量评估性能。
  • 实现可扩展、快速且鲁棒的质量评估,独立于侧链堆积或外部工具。
  • 通过学习内部表示而非原始输出,减轻共享预测器带来的偏差。

提出的方法

  • 设计了一种三头深度神经网络架构,用于比较同一蛋白质的多个模型,学习其相对质量排名。
  • 在已知蛋白质结构的大规模数据库上进行预训练,以从序列派生的输入(如二级结构和溶剂可及性)中学习通用结构特征。
  • 通过在相关但不同的数据集上使用预训练模型的特征初始化网络,应用迁移学习,以提升泛化能力。
  • 该方法仅使用从序列预测得到的粗略结构描述——二级结构、溶剂可及性和残基深度——避免使用详细的三维坐标。
  • 通过将模型对成对输入送入网络,实现比较训练,网络学习预测哪一模型更优,从而提升排序准确性。
  • 在CAS P11数据上对模型进行微调,损失函数同时优化全局和局部得分预测以及目标排序。
Figure 1 : Detail of the 3D structure of the protein 3TDU. Highlighted in yellow are the residues that smoothly transition between helix and coil. Predictions are commonly wrong about the exact position of the boundary.
Figure 1 : Detail of the 3D structure of the protein 3TDU. Highlighted in yellow are the residues that smoothly transition between helix and coil. Predictions are commonly wrong about the exact position of the boundary.

实验结果

研究问题

  • RQ1仅使用序列派生特征的深度学习模型能否在蛋白质模型质量评估中实现最先进性能?
  • RQ2从已知蛋白质结构中进行的迁移学习在模型质量预测性能上有多大提升?
  • RQ3基于排名的比较训练策略在多大程度上优于标准回归方法,能提升预测准确性?
  • RQ4仅限于二级结构和溶剂可及性等最小化输入表示,是否仍能实现高性能?
  • RQ5通过学习内部表示而非外部工具输出,该模型能否有效减少共享预测器带来的偏差?

主要发现

  • 所提出的ProQ4方法在CASP11上实现了最先进性能,尽管仅使用了精简的输入特征集,但在全局得分预测和目标排序方面仍优于现有方法。
  • 迁移学习显著提升了卷积神经网络架构的性能,而多层感知机则未受益甚至因预训练而受损。
  • 三头结构有效学习了模型的排序能力,与真实得分高度一致,且与其他顶尖方法具有高度相关性。
  • 该方法对侧链堆积变化具有鲁棒性,因其仅依赖于不依赖显式三维坐标的粗略结构特征。
  • 相关性矩阵显示,ProQ4的预测结果与其他高性能方法高度一致,表明其性能可靠且稳定。
  • 即使输入极为精简,该模型仍能实现高性能,表明深度学习能够从低维、基于序列的特征中提取有意义的质量信号。
Figure 2 : The 1D ResNet module, the main building block of our convolutional nets
Figure 2 : The 1D ResNet module, the main building block of our convolutional nets

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。