QUICK REVIEW

[論文レビュー] Large-Scale Chemical Language Representations Capture Molecular Structure and Properties

Jerret Ross, Brian Belgodere|arXiv (Cornell University)|Jun 17, 2021

Computational Drug Discovery Methods被引用数 31

ひとこと要約

MoLFormer は、3D ジオメトリを明示的に用いずに、10億を超える SMILES で Transformer エンコーダを事前訓練し、普遍的な分子表現を学習して、量子化学的性質を含む幅広い化学特性を競争力のある精度で予測します。

ABSTRACT

Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.

研究の動機と目的

大規模なラベルなし SMILES データから普遍的な分子表現を学習する。
MoleculeNet の多様な分類・回帰タスクで学習済み表現を評価する。
SMILES に基づく訓練モデルが構造的および空間的な分子情報を捉えるかを分析する。

提案手法

PubChem および ZINC 由来の 1.1B の SMILES をマスク言語モデリングを用いて MoLFormer を事前訓練する。
スケーラブルな訓練のために rotary ポジショナル埋め込みと線形アテンションを使用する。
最後の隠れ状態の埋め込みを平均化して固定サイズの分子表現を形成することで SMILES をエンコードする。
凍結とファインチューニングの両方のレジームで下流タスクに対して小さなタスク専用ヘッドをファインチューニングする。
GNN や言語モデルを含む、監視付き・自己教師付きベースラインの幅広い比較を行う。

実験結果

リサーチクエスチョン

RQ1大規模な事前訓練分子言語モデルは、広範な分子特性を予測する表現を学習できるか？
RQ2SMILES ベースの表現は、明示的な 3D ジオメトリを用いずに、サブ構造や原子間距離といった構造情報を捉えるか？
RQ3MoLFormer は MoleculeNet の分類・回帰ベンチマークで、グラフベースおよび他のベースラインとどのように比較されるか？
RQ4モデルサイズ、データ規模、ポジショナル埋め込みの選択が下流性能に与える影響は？

主な発見

MoLFormer-XL は 1.1B の分子、線形アテンションとロータリ埋め込みで、10 の MoleculeNet タスク（分類と回帰）全体で多くのベースラインに対して競争力がある、または優れている結果を達成する。
MoLFormer 表現は、複数のベンチマークでいくつかのグラフベースおよび言語モデルベースラインを上回り、量子化学的性質を含む他には競合する。
分析では MoLFormer が分子内の空間的関係の側面を学習し、アテンションが原子間距離や結合パターンと相関することを示す。
ロータリーポジショナル埋め込みと線形アテンションはスケールアップ時の訓練を効率化し、必要な GPU を約60分の1程度に削減する。
アブレーション研究は、モデルの深さとファインチューニングが下流の性能に大きく影響する一方、データ混合とポジショナル埋め込みが結果に影響を与えることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。