QUICK REVIEW

[论文解读] Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

Pascal Notin, Mafalda Dias|arXiv (Cornell University)|May 27, 2022

Genomics and Phylogenetic Studies被引用 123

一句话总结

Tranception 引入了一种自回归变换器，在推理时检索同源序列以提升蛋白质适应度预测，特别在浅层比对和插入/删除上提升性能。

ABSTRACT

The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.

研究动机与目标

促使在跨越多种蛋白家族的情况下，准确建模蛋白质适应度景观，即使是那些难以对齐或包含无序区域的蛋白质。
开发一种非 MSA 训练的蛋白质语言模型，能够利用检索在推理时融入同源信息。
在替换、插入和删除方面提升预测，且在不同分类群中实现稳健表现。
提供一个大型、多样化的基准（ProteinGym），以严格评估跨多种测定的突变效应预测。

提出的方法

提出 Tranception，一种自回归变换器，具有专门的注意力（Tranception 注意力），使用分组内核卷积来捕捉多种 k-mer 模式。
用 Grouped ALiBi 取代标准的位置编码，以实现头级距离感知的注意力和更长的上下文建模。
在非对齐的 UniRef 序列上进行训练（700M 参数模型；上下文大小 1024），并应用序列镜像以改善双向评分。
通过计算变异序列与野生型序列之间的对数似然比来对适应度进行评分（方程式 2）。
在推理时，将自回归预测（P_A）与来自推理时检索的 MSA 的检索预测（P_R）结合起来（方程式 3、4）。
利用检索从检索到的 MSA 中获取每个位置的氨基酸分布，利用伪计数和拉普拉斯平滑，并重新加权以校正抽样偏差（Hopf et al. 2017）。
通过针对检索到的 MSA 列进行定制来处理插入/删除，并在新位点依赖自回归模式，然后将从左到右和从右到左的分数取平均以提高稳定性。

实验结果

研究问题

RQ1一个在非对齐序列上训练的自回归变换器是否可以在不依赖训练阶段的 MSA 的情况下达到最先进的蛋白质适应度预测？
RQ2推理时对同源序列的检索是否能提升预测，尤其是对于浅层或无 MSA 的蛋白质，以及对于插入/删除？
RQ3相较于基于比对的模型和其他蛋白质语言模型，Tranception 在替换、多重突变、插入/删除以及跨分类群的表现如何？
RQ4模型是否对 MSA 深度具有鲁棒性，是否能对难以对齐或无序区域进行评分？
RQ5什么基准能够在广泛的测定和分类群上全面评估突变效应（ProteinGym）？

主要发现

Tranception（含检索）在 ProteinGym 的替换和插入/删除基准上超越所有基线。
检索显著提升性能，对浅 MSA 的蛋白以及多重突变的提升最大。
在没有检索的情况下，Tranception 已经超越非检索基线和有竞争力的无对齐模型；有检索时，超过基于比对的方法。
该模型对 MSA 深度具有鲁棒性，能够评分对齐较差或无序的区域，覆盖范围广泛的蛋白质（如 BRCA1、病毒蛋白等）。
Tranception 在序列空间的外推能力突出，对于多重突变比单点突变获得更大增益。
ProteinGym 提供一个多样且广泛的基准（包含插入/删除），揭示了 Tranception 相较于早期方法的明显优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。