QUICK REVIEW

[論文レビュー] Motif-based Graph Self-Supervised Learning for Molecular Property Prediction

Zaixi Zhang, Qi Liu|arXiv (Cornell University)|Oct 3, 2021

Computational Drug Discovery Methods被引用数 95

ひとこと要約

MGSSLはモチーフに基づく自己教師付き事前学習をGNNに導入し、分子グラフ内のモチーフを生成して予測することで、MoleculeNetベンチマークで最先端の結果を達成します。

ABSTRACT

Predicting molecular properties with data-driven methods has drawn much attention in recent years. Particularly, Graph Neural Networks (GNNs) have demonstrated remarkable success in various molecular generation and prediction tasks. In cases where labeled data is scarce, GNNs can be pre-trained on unlabeled molecular data to first learn the general semantic and structural information before being fine-tuned for specific tasks. However, most existing self-supervised pre-training frameworks for GNNs only focus on node-level or graph-level tasks. These approaches cannot capture the rich information in subgraphs or graph motifs. For example, functional groups (frequently-occurred subgraphs in molecular graphs) often carry indicative information about the molecular properties. To bridge this gap, we propose Motif-based Graph Self-supervised Learning (MGSSL) by introducing a novel self-supervised motif generation framework for GNNs. First, for motif extraction from molecular graphs, we design a molecule fragmentation method that leverages a retrosynthesis-based algorithm BRICS and additional rules for controlling the size of motif vocabulary. Second, we design a general motif-based generative pre-training framework in which GNNs are asked to make topological and label predictions. This generative framework can be implemented in two different ways, i.e., breadth-first or depth-first. Finally, to take the multi-scale information in molecular graphs into consideration, we introduce a multi-level self-supervised pre-training. Extensive experiments on various downstream benchmark tasks show that our methods outperform all state-of-the-art baselines.

研究の動機と目的

分子特性予測におけるデータ不足を自己教師付き学習で解決する動機づけ。
ノード/グラフレベルの信号を超えた意味情報を捉えるために、意味のあるグラフモチーフ（機能基）を活用する。
トポロジーとモチーフラベル予測を組み合わせたモチーフ基盤の生成的事前学習フレームワークを開発する。
多段階（原子レベルとモチーフレベル）の自己教師付き事前学習を導入し、多尺度の分子情報を活用する。

提案手法

BRICSを用いて意味的に意味のあるモチーフへ分子を断片化し、モチーフ語彙サイズを制御する2つの後処理ルールを適用する。
モチーフツリーを構築し、オートリグレッシブ生成順序（BFSまたはDFS）を通じてモチーフツリーの尤度 p(T(G);θ) をモデル化する。
各生成ステップに対してトポロジーとモチーフラベル予測ヘッドを設計し、トポロジー項とラベル項を組み合わせたモチーフ生成損失を最適化する。
MGDA-UB/Frank-Wolfeベースの適応重み付けによる多段目標で、原子レベルとモチーフレベルの事前学習を組み合わせ、破局的忘却を回避する。
ZINC15のラベルなし分子250kで事前学習を行い、 scaffold-based splits による8つの MoleculeNet ベンチマークでファインチューニングを行う。

実験結果

リサーチクエスチョン

RQ1モチーフベースの自己教師付きタスクは、分子特性予測においてノードベース・グラフベースのSSLより化学的意味論をより良く捉えられるか？
RQ2多段階（原子とモチーフ）事前学習は、単一レベルや逐次事前学習より下流性能と収束を改善するか？
RQ3異なるモチーフ生成順序（BFS vs DFS）は学習と結果にどのような影響を与えるか？
RQ4モチーフ語彙サイズと断片化戦略がモデルの有効性に与える影響はどの程度か？

主な発見

MGSSLはMoleculeNetの下流ベンチマークのうち8つ中7つで全ての最先端ベースラインを上回る。
BFSを用いたMGSSLは、ベンチマーク全体でDFSより平均ROC-AUCが高い傾向を示す。
MGSSLはさまざまなベースGNNアーキテクチャに対して改善を提供し、特にGINで最大の相対的改善が見られる。
多段階事前学習（原子+モチーフ）は、原子レベルなしおよび逐次事前学習のいずれのアブレーションよりも上回る。
最適なモチーフ語彙サイズ（彼らの断片化戦略から導かれる）は、BRICSのみや過度に粗い/細かい語彙よりも良い性能を発揮する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。