QUICK REVIEW

[논문 리뷰] Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Guillaume Huguet, James Vuckovic|arXiv (Cornell University)|2024. 05. 30.

Monoclonal and Polyclonal Antibodies Research인용 수 6

한 줄 요약

FoldFlow-2는 시퀀스 조건의 SE(3)-등변 흐름-일치 모델로, 시퀀스에 조건화된 단백질 백본을 생성하며, 무조건 생성에서 최첨단 성능과 모티프 골격화 및 제로샷 평형 샘플링을 포함한 조건 디자인 작업에 효과적이다.

ABSTRACT

Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

연구 동기 및 목표

아미노산 시퀀스 정보를 활용하여 3D 단백질 백본 생성을 안내한다.
다중 모달 데이터(구조 + 시퀀스)를 처리하는 SE(3)N-불변 생성 모델을 개발한다.
다양성과 설계가능성을 향상시키기 위해 대규모 합성+PDB 데이터셋으로 학습을 확장한다.
Reinforced Finetuning(ReFT)을 도입하여 생성물을 보조 보상에 맞춰 조정한다.
시퀀스 조건으로 모티프 골격화 및 접힘 등 조건부 디자인 작업을 가능하게 한다.

제안 방법

SE(3)N-불변 흐름 일치를 SE(3)N-invariant flow matching으로 사용하고, 이를 별도의 SO(3) 및 R^3 흐름으로 구현한다.
구조를 IPA 트랜스포머로 인코딩하고 시퀀스는 대규모 사전학습된 단백질 언어 모델(ESM2-650M)을 사용해 인코딩한다.
구조 및 시퀀스 표현을 기하학적 디코더 이전의 다중 모달 트렁크에서 융합한다.
마스킹 전략으로 학습한다: 전체 시퀀스의 50%를 사용할 때도 있고 50%를 마스킹해 무조건 생성학습을 배우게 한다.
대형 필터링된 AlphaFold2/SwissProt 데이터셋(약 160k 구조)을 구성하고 단계적 품질 필터링을 적용한다.
auxiliary rewards를 이용한 Reinforced Finetuning(ReFT)으로 원하는 특성으로 생성을 편향시키며 미세조정한다.

실험 결과

연구 질문

RQ1시퀀스 조건의 SE(3) 흐름 모델이 다양하고 설계가능한 단백질 백본을 생성할 수 있는가?
RQ2시퀀스 조건화가 무조건 생성 품질 및 다양성에 어떤 영향을 미치는가?
RQ3모티프 골격화, 접힘, 인페인팅과 같은 조건부 작업을 모델이 수행할 수 있는가?
RQ4보조 보상으로 보정되는 강화형 파인튜닝(ReFT)이 2차 구조 다양성과 모티프 골격화 성능에 미치는 영향은?
RQ5FoldFlow-2가 최첨단 무조건 및 조건부 단백질 백본 생성기와 비교했을 때 어떤 차이가 있는가?

주요 결과

설계가능성	새로움	다양성	Frac. <2A(↑)	Frac. TM <0.3(↑)	avg max TM(↓)	pairwise TM(↓)	MaxCluster(↑)
RFDiffusion	0.969 ± 0.023	0.116 ± 0.020	0.449 ± 0.012	0.256	0.172	-	-
Chroma	0.636 ± 0.030	0.214 ± 0.033	0.412 ± 0.011	0.272	0.132	-	-
Genie	0.581 ± 0.064	0.120 ± 0.021	0.434 ± 0.016	0.228	0.274	-	-
FrameDiff	0.402 ± 0.062	0.020 ± 0.009	0.542 ± 0.046	0.237	0.310	-	-
FoldFlow	0.820 ± 0.037	0.188 ± 0.025	0.460 ± 0.020	0.247	0.228	-	-
FoldFlow-2	0.976 ± 0.010	0.368 ± 0.031	0.363 ± 0.009	0.205	0.348	-	-

FoldFlow-2는 최첨단 무조건 생성 성능을 달성하며 설계가능성, 새로움, 다양성 측면에서 RFDiffusion 및 FoldFlow를 능가한다.
FoldFlow-2는 ESMFold와 같은 접힘 모델과의 격차를 좁히고 접힘 관련 지표에서 MultiFlow를 능가한다.
ReFT 기반 파인튜닝은 2차 구조 다양성을 증가시키고 조건부 설계 능력(모티프 골격화, VHH 골격화)을 향상시킨다.
모티프 골격화 벤치마크에서 FoldFlow-2(+FT)는 24/24 모티프를 해결했고 VHH 골격화 결과도 경쟁력이 있다.
FoldFlow-2로 제로샷 평형 입체 구성 샘플링은 MD-튜닝 모델과 경쟁력이 있으며 파라미터 수와 컴퓨트가 더 적다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.