QUICK REVIEW

[论文解读] xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Bo Chen, Xingyi Cheng|arXiv (Cornell University)|Jan 11, 2024

Machine Learning in Bioinformatics被引用 9

一句话总结

提出一个统一的蛋白质语言模型，能够同时学习理解与生成，参数规模达到1000亿，训练令牌达1万亿，在18个蛋白质理解基准上取得优异结果，并支持基于PLM的3D结构预测和可控序列生成。

ABSTRACT

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.

研究动机与目标

倡导一个将自编码与自回归目标结合起来的统一框架，用于蛋白质。
将统一蛋白质语言模型扩展到100B参数和1T训练令牌。
展示该模型在蛋白质理解基准上的改进，并实现高级结构预测与生成。
展示一种更快的、基于PLM的单序列结构预测和可控序列生成路径。
讨论部署大型蛋白质基础模型的局限性与实际考虑。

提出的方法

以通用语言模型(GLM)为骨干，结合双向注意力和自回归目标。
在双向前缀区域引入MLM目标以提升理解。
分两阶段进行预训练：首先在约400B令牌上进行MLM；然后在约600B令牌中以20%/80%的比例进行统一的MLM+GLM。
在约940M个唯一序列（约200B残基）上使用96个NVIDIA DGX系统（搭载A100显卡）训练xTrimoPGLM-100B（100B参数）。
通过将折叠模块与PLM表示整合，开发xTrimoPGLM-Fold（xT-Fold），用于单序列结构预测，采用4-bit量化和FlashAttention。
通过有监督微调(SFT)和强化自训练(ReST)实现蛋白质序列生成，使输出与目标属性对齐。

实验结果

研究问题

RQ1一个统一的预训练目标是否能够同时支持蛋白质理解与生成任务？
RQ2扩展到100B参数和1T令牌对蛋白质理解基准的性能有何影响？
RQ3基于PLM的方法是否能在单序列结构预测（xT-Fold）方面与MSA基方法竞争？
RQ4使用SFT和ReST进行可编程生成与可控蛋白质合成的潜力？

主要发现

xTrimoPGLM-100B在四个类别的18个蛋白质理解任务中超越了SOTA基线中的15项。
该模型在两个分布外（OOD）蛋白集合上的困惑度低于对比模型如ESM2-15B和ProGen2-xlarge。
xT-Fold在CAMEO上TM分数为0.86，在CASP15上为0.70，优于一些基于PLM的对手，接近MSA增强方法。
生成的蛋白质呈现多样化结构，预测置信度高（中位数pLDDT约85.4），与PDB条目序列的相似性低，表明在探索新的折叠。
SFT和ReST使生成序列能与期望属性实现可控对齐，在相同协议下常常优于ProGen2和ProtGPT2。
xTrimoPGLM框架显示出可观的扩展性：更大模型往往带来更好性能，尤其在复杂任务中的提升显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。