QUICK REVIEW

[论文解读] Motif-based Graph Self-Supervised Learning for Molecular Property Prediction

Zaixi Zhang, Qi Liu|arXiv (Cornell University)|Oct 3, 2021

Computational Drug Discovery Methods被引用 95

一句话总结

MGSSL通过在分子图中生成并预测基序来进行基于基序的自监督预训练，取得MoleculeNet基准的最先进结果。

ABSTRACT

Predicting molecular properties with data-driven methods has drawn much attention in recent years. Particularly, Graph Neural Networks (GNNs) have demonstrated remarkable success in various molecular generation and prediction tasks. In cases where labeled data is scarce, GNNs can be pre-trained on unlabeled molecular data to first learn the general semantic and structural information before being fine-tuned for specific tasks. However, most existing self-supervised pre-training frameworks for GNNs only focus on node-level or graph-level tasks. These approaches cannot capture the rich information in subgraphs or graph motifs. For example, functional groups (frequently-occurred subgraphs in molecular graphs) often carry indicative information about the molecular properties. To bridge this gap, we propose Motif-based Graph Self-supervised Learning (MGSSL) by introducing a novel self-supervised motif generation framework for GNNs. First, for motif extraction from molecular graphs, we design a molecule fragmentation method that leverages a retrosynthesis-based algorithm BRICS and additional rules for controlling the size of motif vocabulary. Second, we design a general motif-based generative pre-training framework in which GNNs are asked to make topological and label predictions. This generative framework can be implemented in two different ways, i.e., breadth-first or depth-first. Finally, to take the multi-scale information in molecular graphs into consideration, we introduce a multi-level self-supervised pre-training. Extensive experiments on various downstream benchmark tasks show that our methods outperform all state-of-the-art baselines.

研究动机与目标

用自监督学习解决数据匮乏在分子性质预测中的问题。
利用有意义的图基序（功能团）来捕捉超越节点/图级信号的语义信息。
开发一个基于基序的生成式预训练框架，包含拓扑与基序标签预测。
引入多级（原子与基序）自监督预训练，以开发多尺度的分子信息。

提出的方法

使用BRICS将分子分割成语义上有意义的基序，并通过两条后处理规则控制基序词汇表大小。
构建基序树，并通过自回归生成顺序（BFS或DFS）建模基序树的似然 p(T(G);θ)。
为每一步生成设计拓扑和基序标签预测头，并优化结合拓扑项和标签项的基序生成损失。
以多级目标将原子级和基序级预训练结合在一起，使用MGDA-UB/基于Frank-Wolfe的自适应加权，避免灾难性遗忘。
在ZINC15的25万未标记分子上进行预训练，并在八个MoleculeNet基准测试（基于骨架的划分）上进行微调。

实验结果

研究问题

RQ1基于基序的自监督任务是否比节点级或图级SSL在分子性质预测中更好地捕捉化学语义？
RQ2多级（原子与基序）预训练是否相较单级或顺序预训练能提升下游性能和收敛速度？
RQ3不同的基序生成顺序（BFS与DFS）如何影响学习和结果？
RQ4基序词汇表大小与分割策略对模型有效性有何影响？

主要发现

MGSSL在MoleculeNet的8个下游基准中的7个上超越所有现有基线。
MGSSL与BFS在大多数基准上通常比DFS获得更高的平均ROC-AUC。
MGSSL在各种基础GNN架构上都带来收益，相对改进最大的是GIN。
多级预训练（原子+基序）优于没有原子级和顺序预训练的消融。
最佳的基序词汇表大小（来自他们的分割策略）比仅BRICS或过于粗/细的词汇表表现更好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。