QUICK REVIEW

[论文解读] PoET: A generative model of protein families as sequences-of-sequences

Tristan Bepler, Timothy F. Truong|arXiv (Cornell University)|Jun 9, 2023

Genomics and Phylogenetic Studies被引用 26

一句话总结

PoET 是一个自回归 Transformer，将整个蛋白质家族建模为序列的序列，能够进行检索增强的条件化、对插入/缺失敏感的生成，以及在多种蛋白质家族中改进变体适应性预测，而不依赖于 MSAs。

ABSTRACT

Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose $ extbf{P}$r$ extbf{o}$tein $ extbf{E}$volutionary $ extbf{T}$ransformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training. In extensive experiments on deep mutational scanning datasets, we show that PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction across proteins of all MSA depths. We also demonstrate PoET's ability to controllably generate new protein sequences.

研究动机与目标

通过在不依赖 MSAs 的情况下，对大量家族的进化约束进行建模来推动蛋白质设计的改进。
开发一个可扩展、顺序不变的 Transformer 架构，以序列的序列形式生成相关蛋白集合。
实现基于检索的条件化和高效的变体序列的评分/生成，包括插入缺失。
在深度突变扫描数据集上展示改进的变体适应性预测，并展示生成新颖、结构上合理的序列的能力。

提出的方法

引入 PoET，一个对序列-序列分布 P(X=x) 的自回归模型，表示为对序列和标记的乘积。
提出 TieredTransformerDecoderLayer，包含两个注意力模块：PerSequenceSelfAttn（序列内自注意）和 SequenceOfSequencesSelfAttn（序列之间自注意），以在序列间实现顺序不变性并在序列内实现顺序相关性。
对序列内注意使用轮换位置编码（Rotary Positional Encodings），并使用一种新颖的序列间相对位置编码，确保跨序列的顺序不变性，同时保留序列内结构。
在来自 UniRef50 的 2900 万组同源序列上进行训练，使用逆计数加权以平衡集合大小并随机化序列顺序以促进不变性。
在检索得到的同源序列 S 上进行条件化以计算条件适应性分数，并实现检索增强的生成和评分（例如 PoET 的适应性预测使用 log P(v|S))。
在 ProteinGym 深度突变扫描数据集上进行评估，与基于比对的方法、无条件、条件以及混合蛋白语言模型进行比较；显示集成方法提升性能。

实验结果

研究问题

RQ1PoET 是否能将数百万蛋白质序列簇中的进化约束泛化，用于提高对小型或大型家族的变体适应性预测？
RQ2序列之间具有顺序不变性注意力的序列的 Transformer 是否在变体效应预测和 indel 处理方面优于基于 MS 或无条件模型？
RQ3PoET 是否可以作为一个检索增强的语言模型，在不需要 MSAs 的情况下对目标家族进行条件化生成和评分？
RQ4PoET 在生成保持家族特征的新颖、结构上合理的序列方面表现如何？

主要发现

模型类型	模型名称	# 参数	低	中	高	全部	Indels
Alignment-based	Site independent	N/A	0.417	0.404	0.411	0.408	N/A
GEMME	N/A	N/A	0.445	0.449	0.522	0.463	N/A
EVE (ensemble)	N/A	N/A	0.414	0.441	0.498	0.448	N/A
Unconditional PLM	ESM-1v (ensemble)	3.25B	0.356	0.372	0.510	0.398	N/A
ProGen2 (ensemble)	10.8B	0.357	0.416	0.448	0.411	0.407
Tranception L (no retrieval)	700M	0.377	0.399	0.429	0.401	0.430
Conditional MSA Transformer (ens.)	100M	0.372	0.421	0.477	0.423	N/A
PoET (ensemble)	201M	0.476	0.466	0.542	0.484	0.510
Hybrid Tranception L	700M	0.441	0.437	0.472	0.445	0.464
TranceptEVE M	300M	-	-	-	-	0.516
TranceptEVE L	700M	0.454	0.463	0.508	0.471	0.466
PoET (ensemble) + TranceptEVE L	901M	0.479	0.480	0.537	0.492	0.521

PoET 在 ProteinGym 数据集上的变体适应性预测达到最先进或有竞争力的水平，在所有 MSAs 深度下均提升替换预测。
PoET 与 TranceptEVE L 的集合显著提升了替换预测，相较于任一方法单独使用时的表现。
PoET 在 indel 变体预测方面优于基线，能够对训练 MSA 中未出现的插入/缺失进行打分和生成。
更长的上下文长度（最长可达千Token级别）使 PoET 能观察到更多同源序列并提升性能；PoET 对训练上下文长度具有良好泛化能力。
PoET 能生成多样化、新的序列，保持结构可行性（高 pLDDT、TM-score 在原生样折附近聚类），并维持家族级结构完整性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。