Skip to main content
QUICK REVIEW

[论文解读] SELFormer: Molecular Representation Learning via SELFIES Language Models

Atakan Yüksel, Erva Ulusoy|arXiv (Cornell University)|Apr 10, 2023
Computational Drug Discovery Methods被引用 13
一句话总结

SELFormer 是一个基于变换器的化学语言模型,使用 SELFIES 输入,在 2 million ChEMBL 分子上进行预训练,并针对分子性质预测进行微调,在关键任务中超越基于 SMILES 和图的方法。

ABSTRACT

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.

研究动机与目标

  • 推动得到对学习完全有效且鲁棒的改进分子表征
  • 将稳健的基于字符串的分子表示法 SELFIES 与变换器语言模型结合
  • 在大规模药物样分子语料库上进行预训练,并针对多样的性质预测任务进行微调
  • 在多个基准上将性能与基于 SMILES 的模型和基于图的模型进行比较
  • 向研究社区提供代码、数据集和预训练模型的开放访问

提出的方法

  • 将 SMILES 转换为 SELFIES,并使用类似 RoBERTa 的字节级 BPE 进行标记化
  • 在 2M ChEMBL 分子上对 RoBERTa 式的变换器编码器进行掩码语言模型预训练
  • 进行超参数搜索(注意力头数、层数、学习率、批量大小、训练轮次)以选择 SELFormer 和 SELFormer-Lite
  • 在 MoleculeNet 的分类和回归任务上用一个两层线性头对预训练模型进行微调
  • 使用 ROC-AUC 和 PRC 进行分类评估,使用 RMSE 进行回归评估,采用骨架和随机划分
  • 公开提供预训练模型和表征
Figure 1: The schematic representation of the SELFormer architecture and the experiments conducted. (A) the self-supervised pre-training utilizes the transformer encoder module via masked language modeling for learning concise and informative representations of small molecules encoded by their SELFI
Figure 1: The schematic representation of the SELFormer architecture and the experiments conducted. (A) the self-supervised pre-training utilizes the transformer encoder module via masked language modeling for learning concise and informative representations of small molecules encoded by their SELFI

实验结果

研究问题

  • RQ1基于 SELFIES 的变换器能否学习比基于 SMILES 的模型更鲁棒的分子表示?
  • RQ2在大型 SELFIES 语料库上进行预训练对后续分子性质预测性能有何影响?
  • RQ3在标准 MoleculeNet 任务上,微调与使用预训练表征相比有何影响?
  • RQ4在分类和回归基准测试中,基于 SELFIES 的模型与基于图和基于 SMILES 的语言模型相比的表现如何?

主要发现

  • SELFormer 在预测水溶性和不良药物反应(SIDER、ESOL 等)方面优于竞争方法。
  • 基于 SELFIES 的模型在其他下游任务上达到与其他方法相当的结果。
  • 预训练的表征(在进行大量微调之前)已经能够区分具有不同结构特征的分子,如在可视化分析中所见。
  • 消融研究显示在大多数任务中,SELFormer 始终优于较轻的 SELFormer-Lite,且微调能提升性能。
  • 使用优化的超参数进行微调在若干基准上带来显著提升,SELFIES 有助于在对立体化学敏感的任务中提升性能。
Figure 2: Selected use-case molecules from three molecular property-based datasets of; (A) the blood–brain barrier penetration (BBBP), (B) the Side Effect Resource (SIDER); and (C) aqueous solubility (ESOL).
Figure 2: Selected use-case molecules from three molecular property-based datasets of; (A) the blood–brain barrier penetration (BBBP), (B) the Side Effect Resource (SIDER); and (C) aqueous solubility (ESOL).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。