QUICK REVIEW

[논문 리뷰] De Novo Molecular Generation via Connection-aware Motif Mining

Zijie Geng, Shufang Xie|arXiv (Cornell University)|2023. 02. 02.

Machine Learning in Materials Science인용 수 8

한 줄 요약

MiCaM은 큰 분자 라이브러리에서 연결-인지 모티프를 발굴하고 이를 생성기에 활용하여 de novo 분자를 생성하며, 분포 및 목표 지향 벤치마크에서 최첨단 성과를 달성한다.

ABSTRACT

De novo molecular generation is an essential task for science discovery. Recently, fragment-based deep generative models have attracted much research attention due to their flexibility in generating novel molecules based on existing molecule fragments. However, the motif vocabulary, i.e., the collection of frequent fragments, is usually built upon heuristic rules, which brings difficulties to capturing common substructures from large amounts of molecules. In this work, we propose a new method, MiCaM, to generate molecules based on mined connection-aware motifs. Specifically, it leverages a data-driven algorithm to automatically discover motifs from a molecule library by iteratively merging subgraphs based on their frequency. The obtained motif vocabulary consists of not only molecular motifs (i.e., the frequent fragments), but also their connection information, indicating how the motifs are connected with each other. Based on the mined connection-aware motifs, MiCaM builds a connection-aware generator, which simultaneously picks up motifs and determines how they are connected. We test our method on distribution-learning benchmarks (i.e., generating novel molecules to resemble the distribution of a given training set) and goal-directed benchmarks (i.e., generating molecules with target properties), and achieve significant improvements over previous fragment-based baselines. Furthermore, we demonstrate that our method can effectively mine domain-specific motifs for different tasks.

연구 동기 및 목표

휴리스틱 모티프 어휘를 넘어선 방식으로 향상된 조각 기반의 de novo 분자 생성을 촉진한다.
큰 분자 라이브러리에서 빈번하고 연결-인지 모티프를 데이터 기반으로 마이닝하는 방법을 개발한다.
모티프를 선택하고 그것들이 어떻게 연결되는지 결정하여 유효한 분자를 구축하는 생성기를 동시에 만든다.
표준 벤치마크에서 분포 학습(distribution-learning) 및 목표 지향 생성 성능이 우수함을 입증한다.
도메인 특화 모티프가 특정 작업 생성을 위해 효과적으로 마이닝될 수 있음을 보인다.]
method:[
Mine a motif vocabulary by iteratively merging frequent adjacent subgraphs to form connection-aware motifs.
Preserve connection information by marking broken bonds with * and using motif graphs represented with GNNs.
Use a VAE framework to map molecules to latent vectors and condition generation on z and motif representations.
In generation, query either motif connection sites or current molecule sites to decide the next connection or cyclization.
Generate by either attaching a new motif or merging sites to form rings, guided by a start and query network.
Train with reconstruction loss, KL-divergence regularization, and a property-prediction loss to align latent space with molecular properties.

제안 방법

잦은 인접 부분그래프를 반복적으로 병합하여 연결-인지 모티프를 형성함으로써 모티프 어휘를 채굴한다.
끊어진 결합을 * 로 표시하고 GNN으로 표현된 모티프 그래프를 사용하여 연결 정보를 보존한다.
분자를 잠재 벡터로 매핑하고 생성이 z와 모티프 표현에 조건화되도록 VAE 프레임워크를 사용한다.
생성에서 다음 연결 또는 고리 형성을 결정하기 위해 모티프 연결 지점이나 현재 분자 지점을 질의한다.
시작 네트워크와 질의 네트워크에 의해 안내되며, 새로운 모티프를 연결하거나 지점을 병합하여 고리를 형성함으로써 생성한다.
잠재 공간을 분자 속성과 정렬시키기 위해 재구성 손실, KL-발산 정규화, 및 속성 예측 손실로 학습한다.

실험 결과

연구 질문

RQ1데이터 기반 병합 전략이 휴리스틱 어휘보다 향상된 생성 품질을 제공하는 의미 있는 연결-인지 모티프를 발견할 수 있는가?
RQ2연결-인지 모티프 어휘와 모티프 인식 생성기가 표준 벤치마크에서 분포 적합도(KL Div, FCD) 및 고유성/참신성을 향상시키는가?
RQ3모티프 어휘와 네트워크 매개변수를 도메인 특화 작업에 공동으로 적합시켜 최첨단 목표 지향 생성을 달성할 수 있는가?
RQ4병합 연산 수를 제어하는 것이 학습 데이터와의 유사성 대 참신성에 어떤 영향을 미치는가?
RQ5Greedy 모드와 분포형(distributional) 생성 모드가 KL Div/FCD와 참신성 사이의 트레이드오프를 보이는가?

주요 결과

MiCaM은 QM9, ZINC, GuacaMol 데이터셋에서 비교 대상 baselines 중 가장 우수한 KL 발산(KL Divergence) 및 Fréchet ChemNet Distance(FCD)을 달성했다.
MiCaM은 학습 집합에 대한 분포 유사도를 개선하면서도 높은 유효성, 고유성, 참신성을 유지한다.
적당한 수의 병합 연산(약 500)이 높은 유사성을 낳는다; 더 많은 연산은 모티프 크기와 유사성을 증가시키지만 참신성은 감소할 수 있다.
분포형 모드 생성이 Greedy 모드보다 더 높은 참신성을 보이고, Greedy 모드는 유사도 지표를 약간 높인다.
목표 지향 벤치마크에서 MiCaM은 강한 점수를 얻고, 반복적 대상 증강과 결합될 때 여러 작업에서 최첨단 성과를 달성한다.
사례 연구에서 도메인 특화 모티프가 복잡한 분자의 표적 속성 개선을 이끈다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.