QUICK REVIEW

[論文レビュー] SELFIES and the future of molecular string representations

Mario Krenn, Qianxiang Ai|arXiv (Cornell University)|Mar 31, 2022

Machine Learning in Materials Science被引用数 23

ひとこと要約

本論文は分子文字列表現（SMILES、INCHI、DEEP SMILES、SELFIES）を検討し、SELFIESの100%の頑健性を主張し、化学と材料科学におけるAIに優しい頑健な表現のための16の今後のプロジェクトを提案する。

ABSTRACT

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.

研究の動機と目的

分子表現の歴史的展開をたどり、現在の文字列ベースの表現を評価する。
化学におけるAIとMLのためのSMILES、INCHI、DEEP SMILES、SELFIESの長所と短所を強調する。
さまざまな領域で頑健で解釈可能な分子表現を進展させるための具体的な今後の研究方向を提案する。

提案手法

異なる分子文字列表現の標準的特性と頑健性を比較する。
SELFIESの文法と意味的完全性を保証するオーバーローディング機構を説明する。
高分子、結晶、無機化学への一般化の可能性と限界について論じる。
頑健な分子表現のための独立した16の今後のプロジェクトと研究テーマを提案する。

実験結果

リサーチクエスチョン

RQ1さまざまな分子文字列表現は、頑健性とMLへの適合性の点でどの程度有効か？
RQ2小分子有機化合物を超えて頑健な文字列表現を拡張するための今後の方向性と具体的なプロジェクトは何か？
RQ3頑健性を維持しつつ、SELFIESを高分子、結晶、非有機化学へ一般化するにはどうすればよいか？

主な発見

SELFIESは、すべての文字列を有効な分子グラフへ対応づける形式文法を用いることで100%の頑健性を提供する。
SMILESは広く用いられているが、分子ごとに複数の文字列が存在する問題と生成モデルでの無効な出力の問題がある。
IUPAC INCHIは正準化と階層的情報を提供するが、MLベースの生成には難しく、結合情報の一部を失う可能性がある。
DEEP SMILESはSMILESより頑健性を向上させるが、それでも意味的に無効な分子を許すことがある。
この記事は、ドメイン非依存の頑健性（metaSELFIES）や高分子拡張（BigSELFIES）など、具体的な16の今後のプロジェクトを概説している。
頑健な表現は、遺伝的アルゴリズムや探索タスクなどのAI駆動型アプリケーションで利益を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。