QUICK REVIEW

[論文レビュー] Motif-based Graph Self-Supervised Learning for Molecular Property Prediction

Zaixi Zhang, Qi Liu|arXiv (Cornell University)|Oct 3, 2021

Computational Drug Discovery Methods参考文献 53被引用数 44

ひとこと要約

MGSSL は BRICS由来モチーフと多層タスクを用いた GNN のモチーフベース自己教師付き事前学習を導入し、 MoleculeNet ベンチマークで最先端の結果を達成。

ABSTRACT

Predicting molecular properties with data-driven methods has drawn much attention in recent years. Particularly, Graph Neural Networks (GNNs) have demonstrated remarkable success in various molecular generation and prediction tasks. In cases where labeled data is scarce, GNNs can be pre-trained on unlabeled molecular data to first learn the general semantic and structural information before being fine-tuned for specific tasks. However, most existing self-supervised pre-training frameworks for GNNs only focus on node-level or graph-level tasks. These approaches cannot capture the rich information in subgraphs or graph motifs. For example, functional groups (frequently-occurred subgraphs in molecular graphs) often carry indicative information about the molecular properties. To bridge this gap, we propose Motif-based Graph Self-supervised Learning (MGSSL) by introducing a novel self-supervised motif generation framework for GNNs. First, for motif extraction from molecular graphs, we design a molecule fragmentation method that leverages a retrosynthesis-based algorithm BRICS and additional rules for controlling the size of motif vocabulary. Second, we design a general motif-based generative pre-training framework in which GNNs are asked to make topological and label predictions. This generative framework can be implemented in two different ways, i.e., breadth-first or depth-first. Finally, to take the multi-scale information in molecular graphs into consideration, we introduce a multi-level self-supervised pre-training. Extensive experiments on various downstream benchmark tasks show that our methods outperform all state-of-the-art baselines.

研究の動機と目的

Motivate improving molecular property prediction with limited labeled data using self-supervised learning on graph motifs.
Propose a motif-based fragmentation and generation framework to capture semantic substructures in molecules.
Develop a multi-level self-supervised pre-training strategy to unify atom-level and motif-level tasks.
Demonstrate that motif-based pre-training yields superior downstream performance across benchmarks.

提案手法

Fragment molecules into semantically meaningful motifs using BRICS plus two post-processing rules to reduce vocabulary and redundancy.
Construct motif trees and define a motif-generation objective that models p(T(G); θ) via autoregressive generation ( BFS or DFS ).
Design topology and motif-label prediction heads to maximize log-likelihood of motif trees (equation-based losses).
Implement multi-level self-supervision by combining atom-level attribute masking and motif-level generative pre-training with adaptive task weighting via MGDA-UB/Frank-Wolfe.
Unify multi-level losses into L_ssl = λ1 L_motif + λ2 L_atom + λ3 L_bond and optimize without manual weight tuning.

実験結果

リサーチクエスチョン

RQ1How can graph motifs be mined and organized to improve GNN pre-training for molecular properties?
RQ2Can motif-based generative pre-training capture meaningful chemical semantics beneficial for downstream tasks?
RQ3Does a multi-level (atom and motif) SSL framework outperform single-level SSL methods on molecular benchmarks?

主な発見

muv	clintox	sider	hiv	tox21	bace	toxcast	bbbp	平均
71.7 ± 2.3	58.2 ± 2.8	57.2 ± 0.7	75.4 ± 1.5	74.3 ± 0.5	70.0 ± 2.5	63.3 ± 1.5	65.5 ± 1.8	67.0
75.1 ± 2.8	73.0 ± 3.2	58.2 ± 0.5	76.5 ± 1.6	75.2 ± 0.3	75.6 ± 1.0	62.8 ± 0.6	68.1 ± 1.3	70.6
74.7 ± 1.9	77.5 ± 3.1	59.6 ± 0.7	77.9 ± 1.2	77.2 ± 0.4	78.3 ± 1.1	63.3 ± 0.8	65.6 ± 0.9	71.8
74.1 ± 1.4	73.2 ± 2.6	58.0 ± 0.9	75.5 ± 0.8	76.6 ± 0.5	75.0 ± 1.5	63.5 ± 0.4	66.9 ± 0.7	70.4
75.0 ± 2.5	74.9 ± 2.7	59.3 ± 0.8	77.0 ± 1.7	76.1 ± 0.4	78.5 ± 0.9	63.1 ± 0.5	67.5 ± 1.3	71.4
75.8 ± 1.7	76.9 ± 1.9	60.7 ± 0.5	77.8 ± 1.4	76.3 ± 0.6	79.5 ± 1.1	63.4 ± 0.6	68.0 ± 1.5	72.3
78.1 ± 1.8	79.7 ± 2.2	60.5 ± 0.7	79.5 ± 1.1	76.4 ± 0.4	79.7 ± 0.8	63.8 ± 0.3	70.5 ± 1.1	73.5
78.7 ± 1.5	80.7 ± 2.1	61.8 ± 0.8	78.8 ± 1.2	76.5 ± 0.3	79.1 ± 0.9	64.1 ± 0.7	69.7 ± 0.9	73.7

MGSSL outperforms state-of-the-art baselines on eight downstream molecular-property benchmarks when pre-trained on 250k unlabeled molecules from ZINC15.
Motif-based pre-training yields the best average ROC-AUC across benchmarks, with BFS generally achieving slightly better results than DFS.
MGSSL pre-training is agnostic to the base GNN architecture and provides notable gains across GCN, GIN, RGCN, DAGNN, and GraphSAGE.
Multi-level pre-training (atom and motif levels) improves results beyond sequential or single-level pre-training, and automating loss weights via MGDA-UB helps avoid manual tuning.
Fragmentation strategy that combines BRICS with additional rules yields around 12k motifs; too coarse or too fine vocabularies harm performance.
MGSSL pre-trained models converge faster during fine-tuning compared to baselines.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。