QUICK REVIEW

[论文解读] SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry.

Mario Krenn, Florian Häse|arXiv (Cornell University)|May 31, 2019

Computational Drug Discovery Methods参考文献 43被引用 54

一句话总结

本文提出了SELFIES，一种100%有效的基于字符串的分子表征方法，可保证生成的每个字符串都对应一个化学上有效的分子。通过使用分层的、自引用的语法编码分子结构，SELFIES在化学领域实现了稳健的生成式机器学习，使模型的记忆多样性提高了两个数量级，并实现了无需后处理的可解释性、有效分子生成。

ABSTRACT

The discovery of novel materials and functional molecules can help to solve some of society's most urgent challenges, ranging from efficient energy harvesting and storage to uncovering novel pharmaceutical drug candidates. Traditionally matter engineering -- generally denoted as inverse design -- was based massively on human intuition and high-throughput virtual screening. The last few years have seen the emergence of significant interest in computer-inspired designs based on evolutionary or deep learning methods. The major challenge here is that the standard strings molecular representation SMILES shows substantial weaknesses in that task because large fractions of strings do not correspond to valid molecules. Here, we solve this problem at a fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a string-based representation of molecules which is 100\% robust. Every SELFIES string corresponds to a valid molecule, and SELFIES can represent every molecule. SELFIES can be directly applied in arbitrary machine learning models without the adaptation of the models; each of the generated molecule candidates is valid. In our experiments, the model's internal memory stores two orders of magnitude more diverse molecules than a similar test with SMILES. Furthermore, as all molecules are valid, it allows for explanation and interpretation of the internal working of the generative models.

研究动机与目标

解决SMILES的根本局限性，即大量生成的字符串为无效分子，通过创建一种可保证有效性的表征方法。
通过确保所有生成的候选分子从一开始就化学有效，实现可靠且高效的逆向分子设计。
在无需验证过滤的情况下，支持生成模型中化学空间的多样化和内存高效的探索。
通过从搜索空间中消除无效分子候选，促进对生成模型行为的可解释性。
提供一种通用的、基于语法的字符串表征方法，可直接用于任何机器学习模型而无需架构修改。

提出的方法

设计一种分层的、自引用的语法规则，通过递归的、上下文无关的方法编码分子结构，以确保语法正确性。
使用一组固定的产生规则将分子表示为字符串，使价态和连接性约束在语法层面得到强制执行。
使用自引用标记编码分子亚结构，以紧凑方式表示复杂片段。
构建一种字符串表征，使得每个可能的字符串在设计上都对应一个唯一且有效的分子。
将SELFIES直接集成到现有机器学习模型中，无需模型微调或架构更改。
利用基于语法的结构，实现潜在空间中高多样性分子的高效搜索与生成。

实验结果

研究问题

RQ1能否构建一种基于字符串的分子表征方法，使得每个可能的字符串都对应一个有效分子？
RQ2在生成模型中，SELFIES与SMILES在内存效率和生成分子多样性方面有何比较？
RQ3使用100%有效的表征在多大程度上提升了机器学习模型在分子生成中的可解释性和可靠性？
RQ4SELFIES能否无缝集成到现有深度学习框架中，而无需修改模型架构？
RQ5使用SELFIES是否显著增加了模型训练期间探索的独特且有效分子的数量？

主要发现

每个SELFIES字符串都对应一个有效分子，确保无需后处理或过滤即可实现100%的有效性。
与类似SMILES基模型相比，该模型内部存储的分子多样性提高了两个数量级。
SELFIES可直接应用于任何机器学习模型而无需架构适配，简化了集成过程。
该表征支持通过自引用语法生成复杂且有效的分子结构，该语法强制执行化学价态和连接性。
由于搜索空间中不存在无效候选，SELFIES使生成模型行为的解释更加清晰。
SELFIES可表示所有可能的分子，使其成为分子空间的通用且完整的表征。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。