QUICK REVIEW

[論文レビュー] Learning protein sequence embeddings using information from structure

Tristan Bepler, Bonnie Berger|arXiv (Cornell University)|Feb 22, 2019

Machine Learning in Bioinformatics被引用数 191

ひとこと要約

本論文は、全体構造の類似性と局所残基接触という弱い監視信号を用いて、蛋白質配列を残基ごとの埋め込みへマッピングする双方向LSTMベースのエンコーダを訓練し、配列ベースの構造類似性予測を優位に向上させ、膜貫通予測など他のタスクへの移植可能な拡張を実現する。

ABSTRACT

Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function. Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins. We newly approach this problem through the lens of representation learning. We introduce a framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information from (i) global structural similarity between proteins and (ii) pairwise residue contact maps for individual proteins. To enable learning from structural similarity information, we define a novel similarity measure between arbitrary-length sequences of vector embeddings based on a soft symmetric alignment (SSA) between them. Our method is able to learn useful position-specific embeddings despite lacking direct observations of position-level correspondence between sequences. We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction.

研究の動機と目的

配列から構造的文脈をエンコードする蛋白質表現を、全体構造類似性という弱い監視信号から学習する動機づけ。
埋め込み列を比較する微分可能なソフト対称アライメント（SSA）機構の開発。
残基-残基の接触情報を位置レベルの監視信号として組み込み、埋め込みを改善。
学習済み埋め込みが構造類似性予測を改善し、膜貫通予測など他の蛋白質予測タスクへ移植可能であることを示す。

提案手法

配列を100次元埋め込みの系列へ写像する3層の双方向LSTM（biLSTM）エンコーダを使用。
入力として事前学習済み蛋白質言語モデル（Pfam）の隠れ状態を任意でエンコーダへ組み込む。
2つの埋め込み系列間でソフト対称アライメント（SSA）を定義し、グローバルな類似性スコアを計算。
整列スコアをSCOPベースの類似性レベルへ序数回帰（単調制約付き）で関連付け。
局所的な接触情報を、ペアワイズ埋め込み特徴と畳み込み予測子を用いた残基-残基接触予測タスクとして目的関数に付加。
同時学習の多タスク損失（類似性損失と接触予測損失をλで加重）を用いてエンドツーエンドで訓練。

実験結果

リサーチクエスチョン

RQ1配列から学習した残基ごとの埋め込みは、位置レベルの直接的なアラインメントデータなしで構造的文脈情報を捉えられるか。
RQ2SSAは埋め込み列を比較する他のアライメント方式より優れているか。
RQ3局所接触情報を取り入れることで学習済み埋め込みと下流予測は向上するか。
RQ4学習済み埋め込みは膜貫通領域予測など他の蛋白質予測タスクへ転移可能か。

主な発見

モデル	Accuracy	r	ρ	クラス	フォールド	スーパーファミリー	ファミリー
NW-align	0.78462	0.18854	0.14046	0.30898	0.40875	0.58435	0.52703
phmmer [HMMER 3.2.1]	0.78454	0.21657	0.06857	0.26022	0.34655	0.53576	0.50316
HHalign [HHsuite 3.0.0]	0.78851	0.36759	0.23240	0.40347	0.62065	0.86444	0.52220
TMalign	0.80831	0.61687	0.37405	0.54866	0.85072	0.83340	0.57059
SSA (full)	0.95149	0.90954	0.69018	0.91458	0.90229	0.95262	0.64781
NW-align	0.80842	0.37671	0.23101	0.43953	0.77081	0.86631	0.82442
phmmer [HMMER 3.2.1]	0.80907	0.65326	0.25063	0.38253	0.72475	0.82879	0.81116
HHalign [HHsuite 3.0.0]	0.80883	0.68831	0.27032	0.47761	0.83886	0.94122	0.82284
TMalign	0.81275	0.81354	0.39702	0.59277	0.91588	0.93936	0.82301
SSA (full)	0.93151	0.92900	0.66860	0.89444	0.93966	0.96266	0.86602

SSA埋め込みモデルは、配列からの構造類似性予測で最先端の性能を達成し、配列ベースの手法を上回り、構造ベースのアライナー（TMalign）と同等レベルの評価をSCOPベースのタスクで示した。
SCOPe ASTRAL 2.06 テストセットでは、SSA（full）は精度0.95149、Pearson r=0.90954、Spearman ρ=0.69018、クラス/フォールド/超ファミリー/ファミリーのリトリーバルにおける平均適合度がベースラインより高い。
SCOPe 2.07 新テストセットでは、SSA（full）は精度0.93151、r=0.92900、ρ=0.66860で、フォールド/超ファミリー/ファミリのリトリーバル性能が強力。
アブレーション実験により、SSAアライメント、言語モデル入力、および残基接触監視の包括が、構造類似性と二次構造予測の改善に寄与。
大規模な未ラベル蛋白質配列での言語モデル事前学習は、LM入力を使用しない場合と比べてSCOP類似性分類を大幅に改善。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。