Skip to main content
QUICK REVIEW

[Paper Review] Constrained Graph Variational Autoencoders for Molecule Design

Qi Liu, Miltiadis Allamanis|arXiv (Cornell University)|May 23, 2018
Computational Drug Discovery Methods31 references234 citations
TL;DR

CGVAE proposes a graph-structured VAE with a sequential graph generation process and hard domain-specific masks to generate valid molecules, enabling latent-space optimization of molecular properties.

ABSTRACT

Graphs are ubiquitous data structures for representing interactions between entities. With an emphasis on the use of graphs to represent chemical molecules, we explore the task of learning to generate graphs that conform to a distribution observed in training data. We propose a variational autoencoder model in which both encoder and decoder are graph-structured. Our decoder assumes a sequential ordering of graph extension steps and we discuss and analyze design choices that mitigate the potential downsides of this linearization. Experiments compare our approach with a wide range of baselines on the molecule generation task and show that our method is more successful at matching the statistics of the original dataset on semantically important metrics. Furthermore, we show that by using appropriate shaping of the latent space, our model allows us to design molecules that are (locally) optimal in desired properties.

Motivation & Objective

  • Motivate learning to generate graphs that follow a training-data distribution with chemical validity constraints.
  • Develop a variational autoencoder where both encoder and decoder operate on graph-structured data.
  • Incorporate hard, domain-specific constraints to ensure syntactically valid molecule graphs.
  • Shape and utilize the latent space to enable optimization of numerical molecular properties.

Proposed method

  • Use gated graph neural networks (GGNNs) in both encoder and decoder of a VAE.
  • Adopt a sequential graph extension process with focus and expand decisions to build graphs, while conditioning only on the current partial graph.
  • Apply hard valency-based masking to enforce chemical validity and prevent illegal graphs.
  • Train with a reconstruction objective that approximates the log-likelihood over generation traces via Monte Carlo estimates.
  • Provide a mechanism to optimize properties in latent space via a differentiable regression model and gradient ascent in z-space.

Experimental results

Research questions

  • RQ1Can a graph-structured VAE with sequential graph generation produce molecules that match the training distribution on chemically relevant statistics?
  • RQ2Does masking and GGNN-based decoding improve validity, novelty, and uniqueness of generated molecules across datasets?
  • RQ3Can the learned latent space be exploited to optimize numerical molecular properties such as QED?
  • RQ4How does constraining graph generation influence scalability and training stability compared to non-constrained graph generators?

Key findings

  • CGVAE achieves high validity, novelty, and uniqueness across QM9, ZINC, and CEPDB datasets.
  • The model matches training-graph statistics such as atom and bond counts and ring counts, indicating faithful distribution capture.
  • Masking and sequential decoding with GGNNs are critical to performance, as removing distance features, independence assumptions, or the GGNN degrades results.
  • The latent space enables gradient-based optimization of properties like QED, producing molecules with higher predicted and RDKit-measured QED along a trajectory.
  • Compared to baselines, CGVAE reduces invalid molecule generations and offers a shallow, stable training process while enabling continuous optimization.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.