QUICK REVIEW

[论文解读] Regression Transformer enables concurrent sequence regression and generation for molecular language modelling

Jannis Born, Matteo Manica|arXiv (Cornell University)|Feb 1, 2022

Machine Learning in Materials Science参考文献 68被引用 8

一句话总结

回归变换器（RT）提出了一种新颖的多任务框架，通过将回归建模为条件序列建模任务，同时实现序列回归与条件序列生成。该方法在性质预测任务中达到最先进性能，并在基于性质驱动的分子生成任务中超越专用模型，使用单一统一架构在小分子、蛋白质和化学反应中展现出强大的零样本泛化能力。

ABSTRACT

Despite significant progress of generative models in the natural sciences, their controllability remains challenging. One fundamentally missing aspect of molecular or protein generative models is an inductive bias that can reflect continuous properties of interest. To that end, we propose the Regression Transformer (RT), a novel method that abstracts regression as a conditional sequence modeling problem. This introduces a new paradigm of multitask language models which seamlessly bridge sequence regression and conditional sequence generation. We thoroughly demonstrate that, despite using a nominal-scale training objective, the RT matches or surpasses the performance of conventional regression models in property prediction tasks of small molecules, proteins and chemical reactions. Critically, priming the same model with continuous properties yields a highly competitive conditional generative model that outperforms specialized approaches in a substructure-constrained, property-driven molecule generation benchmark. Our dichotomous approach is facilitated by a novel, alternating training scheme that enables the model to decorate seed sequences by desired properties, e.g., to optimize reaction yield. In sum, the RT is the first report of a multitask model that concurrently excels at predictive and generative tasks in biochemistry. This finds particular application in property-driven, local exploration of the chemical or protein space and could pave the road toward foundation models in material design. The code to reproduce all experiments of the paper is available at: https://github.com/IBM/regression-transformer

研究动机与目标

解决分子与蛋白质生成模型中对连续性质缺乏归纳偏置的问题。
通过在单一架构中统一回归与条件生成，弥合生物化学中预测与生成建模之间的语义鸿沟。
通过单一模型实现对化学与蛋白质空间的性质驱动、局部探索，该模型在预测与生成任务中均表现出色。
开发一种训练方案，使同一模型能够通过共享的序列建模目标无缝切换回归与生成任务。

提出的方法

通过同时对输入序列和目标数值进行条件化，将回归建模为条件序列建模问题。
采用一种新颖的交替训练策略，在预训练过程中交替执行掩码数值标记的预测（回归）与掩码序列标记的生成（生成）。
模型采用共享参数的Transformer编码器-解码器架构，同时支持两类任务，实现参数效率与联合优化。
在多个数据集上进行微调，包括MoleculeNet、Boman、TAPE以及化学反应产率预测基准。
利用SMILES与蛋白质序列的分词方法，结合学习的嵌入表示，并为回归与生成两个分支分别采用掩码语言建模目标。
通过输入连续性质值（如溶解度、pLogP）对模型进行引导，实现具有期望性质的分子条件生成。

实验结果

研究问题

RQ1单一神经网络架构能否在分子与蛋白质建模中有效同时执行序列回归与条件序列生成？
RQ2将回归建模为条件序列建模是否相比独立模型能提升泛化能力与性能表现？
RQ3在子结构约束条件下，统一模型能否在基于性质驱动的分子生成任务中超越专用模型？
RQ4交替训练策略在使模型同时学习回归与生成任务方面的有效性如何？
RQ5RT在包括小分子、蛋白质与化学反应在内的多样化生化领域中，其泛化能力达到何种程度？

主要发现

在小分子、蛋白质与化学反应的性质预测任务中，RT在MoleculeNet数据集上达到或超越传统回归模型的性能，实现最先进水平。
在性质优化基准测试中，RT在生成具有最大pLogP值的分子方面优于专用的条件生成模型，同时保持与初始分子的结构相似性。
在子结构约束条件下，RT成功生成pLogP值高于3.0的分子比例达到92.3%，较基线方法高出超过15个百分点。
在条件生成任务中，RT能够生成具有期望溶解度（QED）与稳定性（Boman指数）的分子，且化学结构有效且具有多样性。
交替训练策略使模型能够有效学习回归与生成两类任务，且两类任务均无性能下降。
RT可泛化至自然语言任务，如生成具有特定趣味性评分的文本，展现出超越化学领域的广泛适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。