[论文解读] Diffusion Language Models Are Versatile Protein Learners
DPLM is a discrete diffusion-based protein language model pre-trained on evolutionary-scale sequences that can generate novel sequences and serve as a strong representation learner for downstream predictive tasks, with versatile conditioning options including partial sequence, cross-modal, and classifier-guided generation.
This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance. Code is released at \url{https://github.com/bytedance/dplm}.
研究动机与目标
- Motivate the need for a versatile protein LM that combines strong generative and predictive capabilities.
- Propose discrete diffusion pre-training to unify generation and understanding for protein sequences.
- Demonstrate DPLM's ability to generate structurally plausible, novel proteins and serve as superior representations for downstream tasks.
- Showcase conditioning modalities including partial-sequence conditioning, cross-modal conditioning, and plug-and-play guidance for controllable generation.
提出的方法
- Adopt a discrete diffusion probabilistic framework operating over protein sequences as a principled generalization of language modeling.
- Define forward diffusion with Cat(x^(t)|x^(t-1)) under a noise schedule and an absorbing [X] state to mimic masking.
- Use a reparameterized backward denoising objective that reduces to masked-LM and autoregressive LM special cases (Equation 4).
- Pre-train on UniRef50 (~45M sequences, ~14B tokens), with model scales up to 3B parameters, following a two-stage strategy (masked LM pretraining then diffusion objective).
- Enable generation via iterative denoising from fully noised starts, akin to mask-predict sampling.
- Introduce flexible conditioning: sequence conditioning, cross-modal conditioning with adapters, and discrete classifier-guided conditioning for controllable generation.
实验结果
研究问题
- RQ1Can discrete diffusion pre-training yield a unified model that excels at both generation and understanding of protein sequences?
- RQ2How does DPLM's performance for downstream tasks compare to established protein LMs like ESM2 after diffusion-based pre-training?
- RQ3What conditioning mechanisms (partial sequence, cross-modal, classifier-guided) enable practical and controllable protein sequence design?
- RQ4Does diffusion-based pre-training yield structurally plausible, novel, and diverse protein sequences across lengths?
- RQ5Can DPLM serve as a robust representation learner while enabling high-quality conditional generation?
主要发现
- DPLM generates structurally plausible, novel, and diverse protein sequences with high foldability (pLDDT scores) across lengths, improving with model scale.
- DPLM provides superior representations for downstream predictive tasks compared with ESM2, approaching structure-aware models in some settings.
- Larger DPLM models yield better performance on unconditional generation and downstream tasks, indicating a scaling law for protein LMs.
- DPLM supports conditional generation via motif scaffolding, cross-modal conditioning (e.g., structure-conditioned generation), and plug-and-play classifier guidance to steer properties like secondary structure.
- Discrete diffusion is demonstrated to be more effective than Masked-LM and AR-LM for protein sequence generation and representation learning, with a two-stage training strategy enhancing generation quality.
- Motif-scaffolding experiments show DPLM achieving higher success rates and better motif preservation than baselines, with structure-aware conditioning providing further gains.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。