Skip to main content
QUICK REVIEW

[论文解读] GenMol: A Drug Discovery Generalist with Discrete Diffusion

Seul Lee, Karsten Kreis|arXiv (Cornell University)|Jan 10, 2025
Computational Drug Discovery Methods被引用 3
一句话总结

GenMol 是一个多功能的分子生成框架,利用对 SAFE 表征的离散扩散与片段重掩码来处理从头生成、片段受限生成、靶向命中生成和先导优化,在性能上超越了之前基于 GPT 的方法。

ABSTRACT

Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design. Our code is available at https://github.com/NVIDIA-Digital-Bio/genmol.

研究动机与目标

  • 目标是在一个统一框架内构建一个单一、通用的分子生成器,能够处理多种药物发现任务。
  • 利用离散扩散实现对 SAFE 表征的非自回归、双向序列生成。
  • 引入片段重掩码以在片段层面有效地探索化学空间。
  • 证明单个 GenMol 模型在多种药物发现场景下可以超越任务特定基线方法。

提出的方法

  • 将离散扩散应用于 SAFE 分子表示,以使用 BERT 风格的去噪网络来生成 SAFE 序列。
  • 使用来自掩码扩散的前向掩码过程和反向取消掩码过程,带有一个 NELBO 目标(跨时间加权的 MLM 损失)。
  • 采用非自回归、双向的并行解码以提高效率,并利用 SAFE 中的片段顺序不变性。
  • 引入片段重掩码,将片段替换为掩码块并通过离散扩散重新生成,从而在片段层面进行探索。
  • 推理阶段,采用带有 softmax 温度的置信采样和基于 Gumbel 的随机性对每一步的前 N 个标记进行解掩码,以在质量与多样性之间取得平衡。
  • 构建一个动态片段词汇表并在生成过程中更新,以实现对初始片段之外的探索。
Figure 1 : Results on various drug discovery tasks. The values are quality, average quality, sum AUC top-10, and success rate for de novo generation, fragment-constrained generation, hit generation, and lead optimization, respectively. The “best baseline” refers to multiple best-performing task-spec
Figure 1 : Results on various drug discovery tasks. The values are quality, average quality, sum AUC top-10, and success rate for de novo generation, fragment-constrained generation, hit generation, and lead optimization, respectively. The “best baseline” refers to multiple best-performing task-spec

实验结果

研究问题

  • RQ1一个生成模型能否有效应用于药物发现中的从头生成、片段约束生成、命中生成和先导优化等多种任务?
  • RQ2相较于自回归、GPT 风格基线,在 SAFE 表征上应用离散扩散并采用双向非自回归解码是否能提升生成质量与效率?
  • RQ3片段重掩码是否在探索化学空间以实现先导优化和命中生成方面优于基于标记的重掩码?
  • RQ4GenMol 如何在不同任务和采样设置下平衡质量与多样性?

主要发现

  • GenMol 在多项任务上显著超越了早期的基于 GPT 的 SAFE-GPT,包括从头生成和片段约束生成。
  • GenMol 在目标导向的命中生成与先导优化上达到最先进的性能。
  • 非自回归、双向解码结合离散扩散使采样更快,并更好地利用分子上下文。
  • 片段重掩码实现了在片段层面对化学空间的有效探索,相较于标记级重掩码在优化任务中表现更优。
  • GenMol 在从头生成中保持近乎完美的唯一性,并在不同设置下展示出强烈的质量-多样性权衡。
Figure 2 : (a) GenMol architecture. GenMol adopts the BERT architecture and is trained with the NELBO loss of masked discrete diffusion. (b) Generation process of GenMol. Under masked discrete diffusion, GenMol completes a molecule by simulating backward in time and predicting masked tokens at each
Figure 2 : (a) GenMol architecture. GenMol adopts the BERT architecture and is trained with the NELBO loss of masked discrete diffusion. (b) Generation process of GenMol. Under masked discrete diffusion, GenMol completes a molecule by simulating backward in time and predicting masked tokens at each

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。