QUICK REVIEW

[论文解读] Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction

Talia B. Kimber, Sebastian Engelke|arXiv (Cornell University)|Dec 11, 2018

Machine Learning in Materials Science参考文献 6被引用 42

一句话总结

本论文提出了卷积神经指纹（CNF）模型，该模型在SMILES表示上使用CNN，并利用SMILES多重性进行数据增强，在与传统描述符相比具有竞争力的准确性，且在小数据集上常常提升结果。

ABSTRACT

In our study, we demonstrate the synergy effect between convolutional neural networks and the multiplicity of SMILES. The model we propose, the so-called Convolutional Neural Fingerprint (CNF) model, reaches the accuracy of traditional descriptors such as Dragon (Mauri et al. [22]), RDKit (Landrum [18]), CDK2 (Willighagen et al. [43]) and PyDescriptor (Masand and Rastija [20]). Moreover the CNF model generally performs better than highly fine-tuned traditional descriptors, especially on small data sets, which is of great interest for the chemical field where data sets are generally small due to experimental costs, the availability of molecules or accessibility to private databases. We evaluate the CNF model along with SMILES augmentation during both training and testing. To the best of our knowledge, this is the first time that such a methodology is presented. We show that using the multiplicity of SMILES during training acts as a regulariser and therefore avoids overfitting and can be seen as ensemble learning when considered for testing.

研究动机与目标

展示卷积神经网络与多SMILES表示在分子预测中的协同效应。
证明SMILES多重性对CNF模型起到数据增强正则化作用。
在回归与分类任务中对比CNF与传统描述符及其他神经网络模型的性能。

提出的方法

将SMILES表示为通过CNN层处理的独热编码字符串，以生成神经指纹。
融合受ResNet和神经指纹概念启发的扁平与分层CNN架构。
使用卷积后的哈希将局部敏感嵌入哈希为密集特征。
在训练和测试阶段应用SMILES增强，以创造数据增强和集成效果。

实验结果

研究问题

RQ1基于CNN的SMILES特征提取能否在QSAR/QSPR任务中与传统分子描述符竞争？
RQ2在训练和测试阶段增加SMILES多重性是否比仅使用规范SMILES能提升预测性能？
RQ3CNF在不同数据集规模的回归和分类目标上的性能有何变化？

主要发现

目标	规模	增强	RMSE/AUC
MP	9104	1/1	45.6
MP	9104	10/1	42.8
MP	9104	1/10	96.2
MP	9104	10/10	39.2
MP	9104	10/25	39.0
BP	1893	1/1	25.0
BP	1893	10/1	20.7
BP	1893	1/10	61.2
BP	1893	10/10	18.6
BP	1893	10/25	18.6
BCF	378	1/1	0.78
BCF	378	10/1	0.71
BCF	378	1/10	1.20
BCF	378	10/10	0.65
BCF	378	10/25	0.65
FreeSolv	642	1/1	1.42
FreeSolv	642	10/1	1.40
FreeSolv	642	1/10	2.30
FreeSolv	642	10/10	1.14
FreeSolv	642	10/25	1.11
LogS	311	1/1	0.78
LogS	311	10/1	0.67
LogS	311	1/10	2.16
LogS	311	10/10	0.62
LogS	311	10/25	0.62
Lipo	200	1/1	0.81
Lipo	200	10/1	0.76
Lipo	200	1/10	1.21
Lipo	200	10/10	0.67
Lipo	200	10/25	0.68
BACE	513	1/1	0.98
BACE	513	10/1	0.78
BACE	513	1/10	1.32
BACE	513	10/10	0.71
BACE	513	10/25	0.71
DHFR	739	1/1	0.78
DHFR	739	10/1	0.76
DHFR	739	1/10	1.32
DHFR	739	10/10	0.70
DHFR	739	10/25	0.71
LEL	483	1/1	1.0
LEL	483	10/1	1.0
LEL	483	1/10	1.1
LEL	483	10/10	1.0
LEL	10/25	10/25	1.0
Target	Model	Performance (AUC)
HIV	1127	CNF	0.79
HIV	1127	KernelSVM	0.792
AMES	542	CNF	0.87
AMES	542	Deepchem	NA
BACE	513	CNF	0.88
BACE	513	RF	0.867
Clintox	478	CNF	0.73
Clintox	478	Weave	0.832
Tox21	831	CNF	0.84
Tox21	831	ConvGraph	0.829
BBBP	2039	CNF	0.92
BBBP	2039	KernelSVM	0.729
JAK3	886	CNF	0.78

采用SMILES增强的CNF常常达到或超过Dragon、RDKit、CDK2和PyDescriptor等传统描述符。
在训练阶段对SMILES进行增强显著提升预测性能，与数据增强的好处一致。
仅在测试阶段的增强通常会降低性能，表明在未事先暴露的情况下，模型对非规范SMILES的映射较差。
在训练和测试阶段都进行SMILES增强能取得最佳结果，显示数据增强和集成效应。
CNF在多个目标上，与DeepChem的最新模型在回归和分类任务中表现相当或更好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。