[论文解读] From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation
SoftMol 提出带软片段 SMILES 表示的块扩散分子语言模型(SoftBD)以及对目标感知生成的门控 Monte Carlo 树搜索,达到最先进的全新分子设计与目标特异性结果,且 100% 有效性与更快的采样速度。
Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol
研究动机与目标
- 通过更好地捕捉图结构来推动分子生成超越自回归标记预测的能力提升。
- 共同设计分子表示、模型架构和搜索策略以实现目标感知设计。
- 引入在固定长度块上运行的扩散式块建模方法,以在有界分子空间内实现化学有效性。
- 在药理约束下展示全新分子设计和蛋白靶向设计的最先进性能。
提出的方法
- 通过将固定长度的 SMILES 分割为相邻区块来定义软片段,而不使用启发式规则。
- 实现 SoftBD,一种块扩散 transformer,具备块内双向注意力和块间因果注意力以建模局部化学子结构。
- 在 ZINC-Curated 上训练 SoftBD 以提升药物相似性与合成可及性。
- 使用自适应置信解码以半自回归方式生成块,并采用首次命中采样和贪心、按置信度排序的掩码解码。
- 将带门控的蒙特卡洛树搜索(MCTS)结合可调可行性门控,向靶蛋白拼接片段以实现目标导向。
实验结果
研究问题
- RQ1与基于标记的 MLM 相比,块扩散表示如何影响化学有效性和模型鲁棒性?
- RQ2一个目标感知生成管线是否可以将扩散本地建模与受约束搜索结合起来,以提升结合力与药物相似性?
- RQ3表示粒度(软片段长度)与生成/推理效率之间的权衡是什么?
- RQ4将可行性门控与 MCTS 结合是否能提升靶向分子设计的命中率和多样性?
主要发现
| 方法 | 有效性(%) | 唯一性(%) | 质量(%) | 对接筛选(%) | 多样性 |
|---|---|---|---|---|---|
| SAFE-GPT (Noutahi 等,2024) | 93.2±0.1 | 100.0±0.0 | 54.4±0.6 | 78.3±0.5 | 0.879±0.000 |
| GenMol (Lee 等,2025) | 99.9±0.1 | 96.0±0.3 | 85.2±0.4 | 97.8±0.1 | 0.817±0.000 |
| SoftBD (p=1.0, τ=0.9) | 99.8±0.0 | 100.0±0.0 | 87.1±0.2 | 98.5±0.1 | 0.871±0.000 |
| SoftBD (p=1.0, τ=1.0) | 99.6±0.0 | 100.0±0.0 | 84.7±0.2 | 97.8±0.1 | 0.878±0.000 |
| SoftBD (p=1.0, τ=1.1) | 99.1±0.0 | 100.0±0.0 | 81.7±0.3 | 96.5±0.1 | 0.883±0.000 |
| SoftBD (p=1.0, τ=1.2) | 98.3±0.0 | 100.0±0.0 | 77.7±0.3 | 94.2±0.2 | 0.888±0.000 |
| SoftBD (p=1.0, τ=1.3) | 96.7±0.1 | 100.0±0.0 | 72.9±0.3 | 91.1±0.2 | 0.893±0.000 |
| SoftBD (p=0.95, τ=0.9) | 100.0±0.0 | 98.4±0.1 | 93.5±0.2 | 99.8±0.0 | 0.844±0.000 |
| SoftBD (p=0.95, τ=1.0) | 100.0±0.0 | 99.4±0.1 | 92.8±0.0 | 99.7±0.0 | 0.851±0.000 |
| SoftBD (p=0.95, τ=1.1) | 100.0±0.0 | 99.6±0.1 | 91.9±0.1 | 99.6±0.0 | 0.858±0.000 |
| SoftBD (p=0.95, τ=1.2) | 99.9±0.0 | 99.8±0.0 | 90.8±0.1 | 99.3±0.1 | 0.867±0.000 |
| SoftBD (p=0.95, τ=1.3) | 99.9±0.0 | 99.8±0.1 | 88.9±0.2 | 98.9±0.1 | 0.871±0.000 |
| SoftBD (p=0.9, τ=0.9) | 100.0±0.0 | 90.0±0.2 | 94.9±0.2 | 99.9±0.0 | 0.829±0.000 |
| SoftBD (p=0.9, τ=1.0) | 100.0±0.0 | 96.0±0.1 | 94.0±0.2 | 99.8±0.0 | 0.839±0.000 |
| SoftBD (p=0.9, τ=1.1) | 100.0±0.0 | 98.0±0.1 | 93.3±0.3 | 99.8±0.0 | 0.846±0.000 |
| SoftBD (p=0.9, τ=1.2) | 100.0±0.0 | 99.1±0.1 | 92.4±0.2 | 99.7±0.1 | 0.852±0.000 |
| SoftBD (p=0.9, τ=1.3) | 100.0±0.0 | 99.3±0.1 | 91.7±0.2 | 99.6±0.0 | 0.858±0.000 |
- SoftBD 在大多数配置下实现了 100% 化学有效性。
- SoftMol 在全新与目标感知设置下将对接亲和力比基线提升 9.7%。
- 与领先基线相比,多样性提升 2–3 倍。
- 在对 10k 分子进行采样时,相较 GenMol(离散扩散),推理速度提升约 6.6 倍。
- SoftMol 在靶向特定任务中保持高唯一性,近 3,000 次尝试产生近 3,000 个独特候选分子。
- 使用高质量的 ZINC-Curated 训练集和块扩散建模在全新及靶向分子设计中均达到最先进性能。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。