QUICK REVIEW

[论文解读] WideDTA: prediction of drug-target binding affinity

Hakime Öztürk, Elif Özkırımlı|arXiv (Cornell University)|Feb 4, 2019

Computational Drug Discovery Methods参考文献 9被引用 152

一句话总结

WideDTA 使用来自蛋白质序列、配体 SMILES、PROSITE 区域/基序以及配体 MCS 的基于词的表示来预测药物-靶标结合亲和力，在基准数据集上超过 DeepDTA。面向激酶的数据中，域/基序与 MCS 的增益有限。

ABSTRACT

Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity. Results: WideDTA uses four text-based information sources, namely the protein sequence, ligand SMILES, protein domains and motifs, and maximum common substructure words to predict binding affinity. WideDTA outperformed one of the state of the art deep learning methods for drug-target binding affinity prediction, DeepDTA on the KIBA dataset with a statistical significance. This indicates that the word-based sequence representation adapted by WideDTA is a promising alternative to the character-based sequence representation approach in deep learning models for binding affinity prediction, such as the one used in DeepDTA. In addition, the results showed that, given the protein sequence and ligand SMILES, the inclusion of protein domain and motif information as well as ligand maximum common substructure words do not provide additional useful information for the deep learning model. Interestingly, however, using only domain and motif information to represent proteins achieved similar performance to using the full protein sequence, suggesting that important binding relevant information is contained within the protein motifs and domains.

研究动机与目标

用文本表示法预测蛋白质-配体结合亲和力。
评估加入域/基序和最大公共同子结构信息是否提升预测效果。
将基于词的 WideDTA 与以往字符基模型及传统方法在 Davis 与 KIBA 数据集上进行比较。

提出的方法

将蛋白质序列表示为 3-残基词（PS）。
将配体 SMILES 表示为使用滑动窗口的 8 字符词（LS）。
从 PROSITE 提取蛋白质域/基序并表示为 3-残基词（PDM）。
提取配体最大公共同子结构并表示为词（LMCS）。
对每个信息源使用两层一维卷积神经网络再加最大池化以获得特征，然后拼接并通过三层带 dropout 的全连接层。
在 Davis 和 KIBA 数据集上使用 Concordance Index (CI)、MSE 和 Pearson 相关性进行训练与评估；并与 KronRLS、SimBoost 和 DeepDTA 进行比较。

实验结果

研究问题

RQ1相比字符型方法，基于词的蛋白质与配体表示是否能改进结合亲和力的预测？
RQ2PDM（域/基序信息）与 LMCS 词是否在全蛋白质序列和 LS/LMCS 之外提供额外的预测价值？
RQ3与最先进方法相比，WideDTA 在 Davis 与 KIBA 基准上的性能如何？

主要发现

Dataset	Model	Proteins	Compounds	CI	MSE	Pearson
Davis	WideDTA (best)	PS + PDM	LS + LMCS	0.886 (0.003)	0.262 (0.009)	0.814 (0.003)
Davis	KronRLS	S-W + PubChem Sim	0.871 (0.0008)	0.379	-
Davis	SimBoost	S-W + PubChem Sim	0.872 (0.002)	0.282	-
Davis	DeepDTA (char)	PS (char) + LS (char)	0.878 (0.004)	0.261	-
KIBA	WideDTA (best)	PS + PDM	LS + LMCS	0.875 (0.001)	0.179 (0.008)	0.856 (0.003)
KIBA	KronRLS	S-W + PubChem Sim	0.782 (0.0009)	0.411	-
KIBA	SimBoost	S-W + PubChem Sim	0.836 (0.001)	0.222	-
KIBA	DeepDTA (char)	PS (char) + LS (char)	0.863 (0.002)	0.194	-

在包含全部四个模块的情况下，WideDTA 在 Davis 上取得最佳结果：CI 0.886 和 MSE 0.262。
在 KIBA 上，最佳 WideDTA 设置达到 CI 0.875 和 MSE 0.179。
仅使用 PS + LS 就已在两个数据集上超越 DeepDTA（Davis CI 0.874；KIBA CI 0.874）。
在 Davis 与 KIBA 中，加入 PDM 并未显著提升性能；在某些情况下，PDM 单独的表现与全序列相似。
LMCS 对 Davis 提供了边际提升，但在 KIBA 相较于 LS 不太有利。
与 KronRLS 和 SimBoost 相比，包含 LS/PS（以及 PDM/LMCS 组合）的 WideDTA 变体在各自数据集上均表现更好；在 Davis 上，最佳 WideDTA CI 0.886 对比 KronRLS 0.871 与 SimBoost 0.872；在 KIBA 上，最佳 WideDTA CI 0.875 对比 DeepDTA 0.863。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。