QUICK REVIEW

[論文レビュー] WideDTA: prediction of drug-target binding affinity

Hakime Öztürk, Elif Özkırımlı|arXiv (Cornell University)|Feb 4, 2019

Computational Drug Discovery Methods参考文献 9被引用数 152

ひとこと要約

WideDTAはタンパク質配列、リガンドSMILES、PROSITEドメイン/モチーフ、およびリガンドMCSからの語彙ベース表現を用いて薬物-標的結合親和性を予測し、ベンチマークデータセットでDeepDTAを上回る。ドメイン/モチーフ情報およびMCSは、キナーゼを中心としたデータでは限られた利点しかもたらさない。

ABSTRACT

Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity. Results: WideDTA uses four text-based information sources, namely the protein sequence, ligand SMILES, protein domains and motifs, and maximum common substructure words to predict binding affinity. WideDTA outperformed one of the state of the art deep learning methods for drug-target binding affinity prediction, DeepDTA on the KIBA dataset with a statistical significance. This indicates that the word-based sequence representation adapted by WideDTA is a promising alternative to the character-based sequence representation approach in deep learning models for binding affinity prediction, such as the one used in DeepDTA. In addition, the results showed that, given the protein sequence and ligand SMILES, the inclusion of protein domain and motif information as well as ligand maximum common substructure words do not provide additional useful information for the deep learning model. Interestingly, however, using only domain and motif information to represent proteins achieved similar performance to using the full protein sequence, suggesting that important binding relevant information is contained within the protein motifs and domains.

研究の動機と目的

タンパク質とリガンドのテキストベース表現を用いて、タンパク質–リガンド結合親和性を予測する。
ドメイン/モチーフ情報および最大公分子部分構造情報を追加することが予測精度を改善するかを評価する。
DavisおよびKIBAデータセット上で、語彙ベースのWideDTAを従来の文字ベースモデルおよび従来手法と比較する。

提案手法

タンパク質配列を3残基語として表現する（PS）。
リガンドSMILESをスライディングウィンドウを用いて8文字語として表現する（LS）。
PROSITEからタンパク質ドメイン/モチーフを抽出し、3残基語として表現する（PDM）。
リガンドの最大共通部分構造を抽出し、語として表現する（LMCS）。
各情報源を2つの1D CNN層と最大プーリングで特徴を抽出し、結合して3層の全結合層（ドロップアウト付き）へ通す。
DavisおよびKIBAデータセットでConcordance Index (CI)、MSE、Pearson相関を用いて訓練・評価し、KronRLS、SimBoost、DeepDTAと比較する。

実験結果

リサーチクエスチョン

RQ1タンパク質とリガンドの語彙ベース表現は、文字ベースの方法と比較して結合親和性予測を改善するか。
RQ2ドメイン/モチーフ情報（PDM）およびLMCS語は、完全なタンパク質シーケンスとLS/LMCSだけでは得られない追加の予測価値をもたらすか。
RQ3DavisおよびKIBAベンチマークにおけるWideDTAの性能は、最先端手法と比べてどうか。

主な発見

データセット	モデル	タンパク質	化合物	CI	MSE	Pearson
Davis	WideDTA (best)	PS + PDM	LS + LMCS	0.886 (0.003)	0.262 (0.009)	0.814 (0.003)
Davis	KronRLS	S-W + PubChem Sim	0.871 (0.0008)	0.379	-
Davis	SimBoost	S-W + PubChem Sim	0.872 (0.002)	0.282	-
Davis	DeepDTA (char)	PS (char) + LS (char)	0.878 (0.004)	0.261	-
KIBA	WideDTA (best)	PS + PDM	LS + LMCS	0.875 (0.001)	0.179 (0.008)	0.856 (0.003)
KIBA	KronRLS	S-W + PubChem Sim	0.782 (0.0009)	0.411	-
KIBA	SimBoost	S-W + PubChem Sim	0.836 (0.001)	0.222	-
KIBA	DeepDTA (char)	PS (char) + LS (char)	0.863 (0.002)	0.194	-

4つのモジュールすべてを用いたWideDTAはDavisデータセットで最良の結果を達成: CI 0.886 および MSE 0.262。
KIBAでは、最良のWideDTA設定でCI 0.875 および MSE 0.179。
PS + LSのみの使用でもすでにDeepDTAを上回る（Davis CI 0.874; KIBA CI 0.874）。
DavisとKIBAでは、PDMを追加しても有意な改善は見られず；PDM単独は場合によっては完全な配列と同等の性能を示した。
LMCSはDavisでは僅かな向上を提供したが、KIBAではLSと比較して有利ではなかった。
KronRLSおよびSimBoostと比較して、LS/PSを組み合わせたWideDTAの変種（PDM/LMCSの組み合わせを含む）は一貫して高い性能を示す。Davisでは、最良のWideDTA CI 0.886 対 KronRLS 0.871 および SimBoost 0.872；KIBAでは、最良のWideDTA CI 0.875 対 DeepDTA 0.863。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。