QUICK REVIEW

[论文解读] GP-MoLFormer: A Foundation Model For Molecular Generation

Jerret Ross, Brian Belgodere|arXiv (Cornell University)|Apr 4, 2024

Nanomaterials for catalytic reactions被引用 9

一句话总结

GP-MoLFormer 是一个自回归的 SMILES 生成器，参数量为 46.8M，使用来自 0.65–1.1B 的规范化 SMILES 进行训练，分析记忆化和数据偏差，并在三个分子生成任务中应用配对微调以实现骨架修饰和属性驱动的优化。

ABSTRACT

Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B (billion) chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations.

研究动机与目标

了解规模和训练数据偏差如何影响大规模化学语言模型中的记忆化与生成。
展示大规模下的 de novo 分子生成质量和多样性。
评估在参数高效调优方法下的骨架约束装饰与无约束属性引导优化。
提供关于数据去重如何影响化学语言模型的新颖性和记忆化的实际见解。

提出的方法

仅解码器的 Transformer，12 层，12 头，隐藏维度 768，采用广义随机傅里叶特征的线性注意力。
旋转位置嵌入用于建模 SMILES 令牌的依赖关系。
自回归因果语言建模目标，在已知前文情况下预测下一个令牌。
在 0.65–1.1B 个规范化 SMILES（来自公开数据库）上进行预训练。
评估在不同训练数据质量和生成池规模下的记忆化与新颖性的差异。
配对微调：一种学习增强令牌的提示微调方法，使生成朝向属性优化分子的方向，而无需对整个模型进行微调。

实验结果

研究问题

RQ1训练数据规模和去重如何影响大规模生成化学语言模型中的记忆化和新颖性？
RQ2GP-MoLFormer 能否在十亿级生成池中生成新颖、有效且多样的分子？
RQ3在 de novo 生成、骨架约束装饰和无约束属性引导优化方面，GP-MoLFormer 是否具备与基线的竞争力？
RQ4配对微调能否在无需全模型微调的情况下实现高效的属性优化？

主要发现

Training Size	Generation Size	Novel	Unique	Valid
650M	30k	0.323	0.997	0.997
650M	100k	0.326	0.998	0.998
650M	1M	0.323	0.996	0.997
650M	10M	0.322	0.989	0.997
1.1B	30k	0.323	0.997	0.997
1.1B	100k	0.326	0.998	0.998
1.1B	1M	0.323	0.996	0.997
1.1B	10M	0.322	0.956	0.997

GP-MoLFormer 即使在生成高达 10 千亿分子时也能生成新颖、有效且独特的 SMILES；在所有池中有 99% 保持有效。
新颖性在原始训练数据下约为 32%，在训练数据去重（Clean）后略有提升（约提高 7–8% 左右）。
在 10M 次生成时，Raw 的新颖性为 0.322，Clean 的新颖性接近 0.322 且略有提高；记忆化通过对训练数据的高精确匹配率显现，最高可达到 60%。
训练数据去重降低了记忆化偏差，通过减少数据流形中对某些分子的过度代表性来提高新颖性。
GP-MoLFormer 在 de novo 生成、骨架约束装饰和无约束属性优化方面与基线相当或超越。
配对微调在惩罚性 logP、QED 与 DRD2 活性优化方面无需全微调即可实现竞争性或优越的结果；演示包括与多项基线的表格对比。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。