QUICK REVIEW

[論文レビュー] xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Bo Chen, Xingyi Cheng|arXiv (Cornell University)|Jan 11, 2024

Machine Learning in Bioinformatics被引用数 9

ひとこと要約

理解と生成を同時に学習する統一的なタンパク質言語モデルを提案し、100Bパラメータと1Tトークンへスケールさせ、18のタンパク質理解ベンチマークで高い成果を達成し、PLMベースの3D構造予測と制御可能な配列生成を可能にする。

ABSTRACT

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.

研究の動機と目的

タンパク質の自己符号化と自回帰目的を組み合わせた統一フレームワークの動機づけ。
統一タンパク質言語モデルを100Bパラメータと1Tトレーニング・トークンへスケールさせる。
モデルがタンパク質理解ベンチマークを改善し、高度な構造予測と生成を可能にすることを示す。
単一配列の構造予測と制御可能な配列生成のためのPLMベースのより高速な経路を実証する。
大規模タンパク質基盤モデルの展開における限界と実務的配慮について議論する。

提案手法

General Language Model (GLM) を backbone とし、双方向の注意機構と自回帰目的を取り入れる。
理解を高めるために、双方向プレフィックス領域で MLM 目的を導入する。
前処理は2段階：最初に約400Bトークンで MLM を学習；次に約600Bトークンに対して20%/80%で統一 MLM+GLM を行う。
xTrimoPGLM-100B（100B パラメータ）を約940Mのユニークシーケンス（約200B残基）上で、A100 GPUを搭載した96台のNVIDIA DGX システムを用いて訓練。
単一配列の構造予測のため、折りたたみモジュールを PLM 表現と統合して xTrimoPGLM-Fold（xT-Fold）を開発し、4-bit 量子化と FlashAttention を用いる。
SFT（監督付きファインチューニング）と ReST（強化自己学習）を用いて、標的特性と出力を整合させたタンパク質配列生成を可能にする。）

実験結果

リサーチクエスチョン

RQ1統一前処理目的は、タンパク質理解と生成タスクを同時にサポートできるか？
RQ2100Bパラメータと1Tトークンへスケールさせた場合、タンパク質理解ベンチマークの性能にどのような影響を与えるか？
RQ3PLMベースのアプローチは、MSAベース手法と比べて競争力のある単一配列構造予測（xT-Fold）を提供できるか？
RQ4SFTとReSTを用いたプログラム可能な生成と制御可能なタンパク質合成の潜在力は？

主な発見

xTrimoPGLM-100B は4カテゴリーにわたる18のタンパク質理解タスクのうち15でSOTAベースラインを上回る。
OODデータセットの2つにおいて、ESM2-15BやProGen2-xlargeなどの比較モデルより困惑度が低い。
xT-FoldはTMスコア0.86（CAMEO）、0.70（CASP15）を達成し、PLMベースのライバルを一部上回り、MSA拡張手法に近づいている。
生成されたタンパク質は多様な構造を示し、予測信頼度が高い（中央値 pLDDT 約85.4）、PDBエントリとの配列相同性が低く、新しい折りたたみを探索していることを示す。
SFTとReSTは、生成された配列を所望の性質に制御可能に揃えることを可能にし、同一プロトコル下で多く場合 ProGen2 や ProtGPT2 を上回る。
xTrimoPGLMフレームワークは明確なスケーリング挙動を示し、より大きなモデルほど性能が向上し、複雑なタスクで顕著な向上を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。