QUICK REVIEW

[논문 리뷰] Motif-based Graph Self-Supervised Learning for Molecular Property Prediction

Zaixi Zhang, Qi Liu|arXiv (Cornell University)|2021. 10. 03.

Computational Drug Discovery Methods인용 수 95

한 줄 요약

MGSSL은 분자 그래프 내 모티프를 생성하고 예측함으로써 GNN에 모티프 기반 자기지도학습을 도입하고 MoleculeNet 벤치마크에서 최첨단 결과를 달성합니다.

ABSTRACT

Predicting molecular properties with data-driven methods has drawn much attention in recent years. Particularly, Graph Neural Networks (GNNs) have demonstrated remarkable success in various molecular generation and prediction tasks. In cases where labeled data is scarce, GNNs can be pre-trained on unlabeled molecular data to first learn the general semantic and structural information before being fine-tuned for specific tasks. However, most existing self-supervised pre-training frameworks for GNNs only focus on node-level or graph-level tasks. These approaches cannot capture the rich information in subgraphs or graph motifs. For example, functional groups (frequently-occurred subgraphs in molecular graphs) often carry indicative information about the molecular properties. To bridge this gap, we propose Motif-based Graph Self-supervised Learning (MGSSL) by introducing a novel self-supervised motif generation framework for GNNs. First, for motif extraction from molecular graphs, we design a molecule fragmentation method that leverages a retrosynthesis-based algorithm BRICS and additional rules for controlling the size of motif vocabulary. Second, we design a general motif-based generative pre-training framework in which GNNs are asked to make topological and label predictions. This generative framework can be implemented in two different ways, i.e., breadth-first or depth-first. Finally, to take the multi-scale information in molecular graphs into consideration, we introduce a multi-level self-supervised pre-training. Extensive experiments on various downstream benchmark tasks show that our methods outperform all state-of-the-art baselines.

연구 동기 및 목표

자기지도 학습으로 분자 특성 예측의 데이터 부족 문제를 해결하도록 동기를 부여한다.
의미 있는 그래프 모티프(기능군)를 활용하여 노드/그래프 수준 신호를 넘는 의미 정보를 포착한다.
토폴로지와 모티프-레이블 예측을 포함하는 모티프 기반 생성형 사전 학습 프레임워크를 개발한다.
다중 스케일의 분자 정보를 활용하기 위해 원자 수준과 모티프 수준의 다중 수준 자기지도 사전 학습을 도입한다.

제안 방법

BRICS를 사용해 의미적으로 의미 있는 모티프로 분자 조각화하고 모티프 어휘 수를 제어하기 위한 두 가지 후처리 규칙을 적용한다.
모티프 트리를 구성하고 BFS 또는 DFS와 같은 자기회귀 생성 순서를 통해 모티프 트리의 가능도 p(T(G);θ)를 모델링한다.
각 생성 단계에 대해 토폴로지 및 모티프-레이블 예측 헤드를 설계하고 토폴로지 항과 레이블 항을 결합한 모티프 생성 손실을 최적화한다.
원자 수준과 모티프 수준의 사전 학습을 다중 수준 목표로 결합하고 MGDA-UB/Frank-Wolfe 기반의 적응 가중치를 사용하여 재앙적 망각을 피한다.
ZINC15의 250k 비표시 분자에서 사전 학습하고 Scaffold 기반 분할을 사용해 MoleculeNet의 8개 벤치마크에서 미세 조정한다.

실험 결과

연구 질문

RQ1모티프 기반 자기지도 작업이 분자 특성 예측에서 노드-나 그래프 수준 SSL보다 화학적 의미를 더 잘 포착할 수 있는가?
RQ2다중 수준(원자 및 모티프) 사전 학습이 단일 수준 또는 순차적 사전 학습에 비해 다운스트림 성능과 수렴 속도를 향상시키는가?
RQ3다른 모티프 생성 순서(BFS 대 DFS)가 학습 및 결과에 어떤 영향을 미치는가?
RQ4모티프 어휘 규모와 분할 전략이 모델 성능에 어떤 영향을 미치는가?

주요 결과

MGSSL은 MoleculeNet의 8개 다운스트림 벤치마크 중 7개에서 최첨단 기준선을 능가한다.
MGSSL은 BFS를 사용한 경우 벤치마크 전반에 걸쳐 평균 ROC-AUC가 일반적으로 DFS보다 높다.
MGSSL은 다양한 기본 GNN 아키텍처에서 이득을 제공하며, 가장 큰 상대적 개선은 GIN에서 관찰된다.
다중 수준 사전 학습(원자+모티프)은 원자 수준 없이 및 순차 사전 학습 비교 연구를 앞선다.
그들의 분할 전략에서의 최적 모티프 어휘 크기는 BRICS 단독이나 과도하게 거칠거나 세밀한 어휘보다 더 나은 성능을 낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.