QUICK REVIEW

[논문 리뷰] Diffusion on language model encodings for protein sequence generation

Viacheslav Meshchaninov, П. В. Страшнов|arXiv (Cornell University)|2024. 03. 06.

Topic Modeling인용 수 7

한 줄 요약

DiMA는 단백질 언어 모델 ESM-2의 임베딩에 연속 확산을 적용하여 아미노산 서열을 생성하고, 품질과 다양성 면에서 자기회귀 및 이산 확산 대비 우수하며, 광범위한 ablation 및 생물학적 관련성 분석을 수행한다.

ABSTRACT

Protein sequence design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present DiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We extensively evaluate existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. DiMA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios. Code is released at \href{https://github.com/MeshchaninovViacheslav/DiMA}{GitHub}.

연구 동기 및 목표

무조건적 단백질 서열 생성을 단백질 우주 전반의 조건부 설계를 위한 기초로 동기 부여한다.
단백질 서열을 생성하기 위해 pLM 임베딩에서 작동하는 확산 모델 DiMA를 제안한다.
서열 및 구조 전반에서 생성 품질, 다양성, 분포 유사성, 생물학적 관련성을 평가한다.
자회귀 및 이산 확산 기반과 비교하고 핵심 설계 선택을 식별하기 위한 ablation을 수행한다.

제안 방법

사전 학습된 ESM-2 단백질 언어 모델로 단백질 서열을 인코딩하여 잠재 임베딩을 얻는다.
손상된 임베딩을 재구성하기 위해 잠재 공간에서 연속 확산 디노이징 모델을 훈련한다.
잠재 임베딩을 다시 아미노산 서열로 매핑하기 위해 디코더를 사용한다.
샘플링 중 이전 z0 예측을 재사용하도록 자기 조건화를 도입하고 학습 시 stop-gradient를 적용한다.
훈련에서 관찰된 경험적 분포에서 서열 길이를 샘플링하고 디코딩 전에 잠재 벡터를 역정규화한다.

실험 결과

연구 질문

RQ1pLM 잠재 공간의 확산이 무조건적으로 고품질의 다양성 있는 단백질 서열을 생성할 수 있는가?
RQ2DiMA는 서열 품질, 다양성 및 분포 유사성 측면에서 자회귀 및 이산 확산 기반과 어떻게 비교되는가?
RQ3아키텍처 및 학습 선택(자기 조건화, 건너뛰기, 시간 조건화, ESM 인코더, 노이즈 스케줄)이 생성 성능에 미치는 영향은?
RQ4생성된 서열이 구조, 기능 및 도메인 주석 측면에서 생물학적 관련성을 나타내는가?

주요 결과

모델	pLDDT (↑)	ESM-2 pppl (↓)	scPerplexity (↓)	TM-score (↑)	BLAST (↑)	FPD (↓)	MMD (↓)	OT (↓)
SwissProt Dataset	80.7	5.35	1.88	0.80	100	0.13	0.00	1.08
Random sequences	25.0	21.54	2.77	0.33	0	3.97	0.20	3.88
nanoGPT	61.0	8.18	2.04	0.63	43	1.24	0.03	2.53
EvoDiff-OADM	37.1	15.77	2.44	0.42	12	1.49	0.11	2.63
SeqDesign	43.1	11.89	2.35	0.41	17	3.53	0.19	5.12
proteinGAN	30.4	16.48	2.57	0.33	0	2.94	0.17	3.98
DiMA	80.8	5.20	1.80	0.85	68	0.41	0.01	1.41
w/o skip connections	77.0	5.84	1.87	0.82	61	0.48	0.02	1.51
w/o time layers	79.4	5.49	1.83	0.85	66	0.44	0.02	1.44
w/o ESM encoder	62.7	9.22	2.09	0.71	48	1.05	0.04	2.14
w/o self-conditioning	68.2	9.18	2.08	0.74	46	0.54	0.04	1.61
w linear schedule	77.0	6.29	1.89	0.82	58	0.50	0.02	1.51
w cosine schedule	54.1	10.86	2.16	0.60	34	0.97	0.06	2.02
AFDB Dataset	83.9	5.79	1.75	0.92	100	0.18	0.00	1.57
Random sequences	26.2	21.67	2.75	0.35	0	3.02	0.18	4.15
nanoGPT	68.5	8.21	1.94	0.77	40	0.62	0.02	1.99
DiMA	73.9	8.50	1.90	0.85	48	0.69	0.03	1.86
w/o self-conditioning	56.3	12.08	2.18	0.69	31	0.96	0.05	2.29
4.1 Ablation w/o self-conditioning	4.1 Baseline models	?	?	?	?	?	?	?

DiMA는 SwissProt 및 AFDBv4-90에서 다수의 품질 및 다양성 지표에서 자회귀 및 이산 확산 기반보다 우수하다.
자기 조건화와 ESM-2 인코더의 사용은 성능에 가장 큰 영향을 주는 설계 선택 중 하나이다.
Simple Diffusion SD-10 노이즈 스케줄은 단백질 잠재 확산에서 선형 또는 코사인 스케줄보다 더 나은 품질과 다양성을 제공한다.
생성된 서열은 InterProSUPERFAMILY 주석 비율이 높고 가설적으로 IDR/2차 구조 프로파일이 타당하여 생물학적 관련성을 보인다.
DiMA는 SwissProt에서 Fréchet ProtT5 거리 및 관련 지표에서 우호적 분포 유사도를 유지하고, AFDBv4-90에서도 경쟁력 있는 결과를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.