QUICK REVIEW

[论文解读] SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

Shion Honda, Shoi Shi|arXiv (Cornell University)|Nov 12, 2019

Computational Drug Discovery Methods参考文献 32被引用 163

一句话总结

介绍 SMILES Transformer，一种基于 Transformer 的分子指纹的预训练模型，在小数据集上实现数据高效的预测，在 MoleculeNet 基准测试中具备竞争力的结果。

ABSTRACT

In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.

研究动机与目标

在药物发现中，尤其是在标注数据有限的情况下，激发对数据高效分子表征的需求。
提出一种基于 Transformer 的、从大量未标注的 SMILES 语料中学习的文本派生指纹。
证明 SMILES Transformer（ST）指纹支持简单的预测器，在 MoleculeNet 任务上实现较强的数据高效性。
引入一个数据高效性度量（DEM），用于在不同训练数据量下评估性能。

提出的方法

构建一个具有四个模块和四头注意力的编码-解码 Transformer，以从 SMILES 生成连续的分子指纹。
在 ChEMBL24 的 861,000 条未标注的 SMILES 上进行预训练，使用 SMILES 枚举策略和交叉熵目标函数。
通过对符号级输出进行池化（均值、最大值、首末层），提取分子级指纹，得到 1024 维向量。
在 10 个 MoleculeNet 数据集上，使用 MLP 风格的预测器，将 ST 指纹与 ECFP4、RNNS2S、GraphConv 进行比较。
定义并计算数据高效性度量（DEM），其在指数递增的训练集大小上对性能取平均。
使用 t-SNE 可视化潜在空间，探究为何 ST 指纹在某些数据集上表现良好。

实验结果

研究问题

RQ1在小数据情形下，ST 指纹是否优于传统指纹和基于图的方法？
RQ2在训练数据稀缺时，ST 相较基线有多高的数据效率？
RQ3ST 潜在空间的哪些属性与跨数据集的预测性能相关？

主要发现

ST 在 10 个 MoleculeNet 数据集中有 5 个获得最佳 DEM 性能，尤其是在小数据情境下（ESOL、FreeSolv、BBBP、ClinTox）。
ST 指纹，与简单预测器（MLP、岭回归/逻辑回归）结合，在若干任务中产生与基线相比具竞争力甚至更优的结果。
ST 与 GraphConv 和 ECFP4 整体具有竞争力，在数据有限设置下可以达到或超过基线。
更长的 SMILES 倾向于提升 ST 的性能，表明更长序列具有更丰富的信息内容。
一种新的数据高效性度量（DEM）在训练数据大小变化时有效捕捉性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。