QUICK REVIEW

[論文レビュー] DPLM-2: A Multimodal Diffusion Protein Language Model

Xinyou Wang, Zaixiang Zheng|arXiv (Cornell University)|Oct 17, 2024

Machine Learning in Bioinformatics被引用数 6

ひとこと要約

DPLM-2は、離散拡散タンパク質言語モデルを拡張し、ルックアップ不要な構造トークナイザーとマルチモーダル学習目的を用いて、タンパク質配列と構造を共同モデリング・生成する。条件なし共生成とさまざまな条件付きタスクを可能にする。

ABSTRACT

Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

研究の動機と目的

タンパク質のシーケンスと構造の統一的モデリングの必要性を動機づけ、対処する。
シーケンスと構造の結合分布を学習するマルチモーダルなタンパク質基盤モデルを開発する。
言語モデル学習のために3D座標を離散トークンへ変換する構造トークナイザーを活用する。
構造学習を強化するために、事前学習済みのシーケンスベースの知識でウォームアップする。
構造認識表現を用いた条件なし共生成と複数の条件付き生成タスクを示す。

提案手法

Extend discrete diffusion protein language model (DPLM) to handle sequences and structures in a unified framework.
Introduce a lookup-free quantizer (LFQ) to tokenize 3D backbone coordinates into discrete structure tokens.
Concatenate structure tokens with amino acid sequences, aligning residue-level positions with shared encodings.
Apply modality-specific noise schedulers and a self-mixup training strategy to mitigate exposure bias in sequence diffusion.
Implement an efficient warm-up from a pre-trained sequence-based DPLM using LoRA to transfer evolutionary knowledge while preserving pre-trained parameters.

実験結果

リサーチクエスチョン

RQ1Can a single multimodal diffusion model jointly model and generate protein sequences and structures with high fidelity?
RQ2How can structure information be learned effectively within a language-model framework?
RQ3What are the benefits of multimodal conditioning for folding, inverse folding, and motif scaffolding tasks?
RQ4Does pre-training on sequence data and data augmentation improve multimodal generation and diversity?

主な発見

DPLM-2は、2段階のカスケードを用いず、互換性のあるタンパク質シーケンスと3D構造を同時に生成する。
実験データとAlphaFold予測構造で学習したモデルは、シーケンスと構造の結合分布、周辺分布、条件付き分布を学習する。
DPLM-2はマルチモーダル入力を用いた折りたたみ、逆折りたたみ、モチーフスキャフォールディングタスクで競争力のある性能を示す。
DPLM-2由来の構造認識表現は、生成だけでなく予測タスクを改善する。
事前学習済みのシーケンスベースDPLMとデータ拡張によるウォームアップは、デザイン可能性と多様性を大幅に向上させ、特に長いタンパク質で顕著である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。