QUICK REVIEW

[论文解读] Learning protein sequence embeddings using information from structure

Tristan Bepler, Bonnie Berger|arXiv (Cornell University)|Feb 22, 2019

Machine Learning in Bioinformatics被引用 191

一句话总结

该论文训练一个基于双向LSTM的编码器，将蛋白质序列映射到每个残基的嵌入，利用全局结构相似性和局部残基接触的弱监督，实现基于序列的结构相似性预测的优势，并可转移到跨越膜预测等其他任务。

ABSTRACT

Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function. Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins. We newly approach this problem through the lens of representation learning. We introduce a framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information from (i) global structural similarity between proteins and (ii) pairwise residue contact maps for individual proteins. To enable learning from structural similarity information, we define a novel similarity measure between arbitrary-length sequences of vector embeddings based on a soft symmetric alignment (SSA) between them. Our method is able to learn useful position-specific embeddings despite lacking direct observations of position-level correspondence between sequences. We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction.

研究动机与目标

Motivate learning protein representations that encode structural context from sequences with weak supervision from global structural similarity.
Develop a differentiable soft symmetric alignment (SSA) mechanism to compare sequences of embeddings.
Incorporate residue-residue contact information as a position-level supervision signal to improve embeddings.
Demonstrate that learned embeddings improve structural similarity prediction and transfer to other tasks like transmembrane prediction.

提出的方法

Use a 3-layer bidirectional LSTM (biLSTM) encoder to map sequences to sequences of 100-dimensional embeddings.
Optionally incorporate hidden states from a pretrained protein-language model (Pfam) as inputs to the encoder.
Define a soft symmetric alignment (SSA) between two embedding sequences to compute a global similarity score.
Relate the alignment score to SCOP-based similarity levels via ordinal regression with monotonic constraints.
Augment the objective with a residue-residue contact prediction task using pairwise embedding features and a convolutional predictor.
Train end-to-end with a multitask loss combining similarity and contact prediction losses (weighted by lambda).

实验结果

研究问题

RQ1Can per-residue embeddings learned from sequence capture structural-context information without direct position-level alignment data?
RQ2Does soft symmetric alignment outperform other alignment schemes for comparing embedding sequences?
RQ3Does incorporating local contact information improve the learned embeddings and downstream predictions?
RQ4Are the learned embeddings transferable to other protein prediction tasks such as transmembrane region prediction?

主要发现

模型	准确率	r	ρ	类别	折叠	超家族	家族
NW-align	0.78462	0.18854	0.14046	0.30898	0.40875	0.58435	0.52703
phmmer [HMMER 3.2.1]	0.78454	0.21657	0.06857	0.26022	0.34655	0.53576	0.50316
HHalign [HHsuite 3.0.0]	0.78851	0.36759	0.23240	0.40347	0.62065	0.86444	0.52220
TMalign	0.80831	0.61687	0.37405	0.54866	0.85072	0.83340	0.57059
SSA (full)	0.95149	0.90954	0.69018	0.91458	0.90229	0.95262	0.64781
NW-align	0.80842	0.37671	0.23101	0.43953	0.77081	0.86631	0.82442
phmmer [HMMER 3.2.1]	0.80907	0.65326	0.25063	0.38253	0.72475	0.82879	0.81116
HHalign [HHsuite 3.0.0]	0.80883	0.68831	0.27032	0.47761	0.83886	0.94122	0.82284
TMalign	0.81275	0.81354	0.39702	0.59277	0.91588	0.93936	0.82301
SSA (full)	0.93151	0.92900	0.66860	0.89444	0.93966	0.96266	0.86602

The SSA embedding model achieves state-of-the-art performance on predicting structural similarity from sequence, outperforming sequence-based methods and even a structure-based aligner (TMalign) on SCOP-based tasks.
On the SCOPe ASTRAL 2.06 test set, SSA (full) achieves accuracy 0.95149, Pearson r = 0.90954, Spearman ρ = 0.69018, and higher average precision for class/fold/superfamily/family retrieval compared with baselines.
On the SCOPe 2.07 new test set, SSA (full) achieves accuracy 0.93151, r = 0.92900, ρ = 0.66860, with strong fold/superfamily/family retrieval performance.
Ablation shows SSA alignment, language-model inputs, and including residue-contact supervision all contribute to improved structural similarity and secondary structure predictions.
Language-model pretraining on large unlabeled protein sequences substantially improves SCOP similarity classification compared to not using LM inputs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。