[论文解读] Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation
FoldFlow-2 是一个以序列为条件的 SE(3)-等变流对齐模型,在给定序列的条件下生成蛋白质骨架,在无条件生成方面达到最新水平,并在包括基序支架化和零-shot 平衡采样等有条件设计任务中表现出色。
Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.
研究动机与目标
- 利用氨基酸序列信息来引导三维蛋白质骨架的生成。
- 开发一个 SE(3)N 不变的生成模型,处理多模态数据(结构 + 序列)。
- 通过大型合成+PDB 数据集扩展训练以提升多样性和可设计性。
- 引入强化微调(ReFT)以使生成与辅助奖励对齐。
- 实现基于序列条件的有条件设计任务,如基序支架化和折叠。
提出的方法
- 使用 SE(3)N 不变的流匹配,分别采用 SO(3) 和 R^3 流。
- 用 IPA Transformer 编码结构,用大型预训练蛋白语言模型(ESM2-650M)编码序列。
- 在几何解码器之前,在多模态主干中融合结构与序列表示。
- 采用掩码策略进行训练:50% 时间使用完整序列,50% 掩码以学习无条件生成。
- 构建一个大型筛选后的 AlphaFold2/SwissProt 数据集(约16万结构)并应用分阶段的质量过滤。
- 使用辅助奖励进行强化微调(ReFT),以引导生成偏向期望属性。
实验结果
研究问题
- RQ1序列条件的 SE(3) 流模型是否能够生成多样且可设计的蛋白质骨架?
- RQ2对生成的无条件质量和多样性,序列条件的影响有多大?
- RQ3模型是否能够执行如基序支架化、折叠和修补等有条件任务?
- RQ4强化微调(ReFT)对二级结构多样性和基序支架性能有何影响?
- RQ5FoldFlow-2 与最新的无条件与有条件蛋白质骨架生成器相比如何?
主要发现
- FoldFlow-2 在无条件生成方面达到最新水平,设计性、新颖性和多样性方面超过 RFDiffusion 和 FoldFlow。
- FoldFlow-2 缩小了与 ESMFold 等折叠模型的差距,并在与折叠相关的指标上超越 MultiFlow。
- 基于 ReFT 的微调提高了二级结构的多样性并提升了有条件设计能力(基序支架化、VHH 支架化)。
- 在基序支架化基准中,FoldFlow-2(+FT)解决了 24/24 个模体,VHH 支架化结果也具竞争力。
- 在零-shot 平衡构象采样方面,FoldFlow-2 具竞争力,与经过 MD 调优的模型相比参数更少、计算量更低。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。