QUICK REVIEW

[논문 리뷰] DiscDiff: Latent Diffusion Model for DNA Sequence Generation

Zehui Li, Yuhao Ni|arXiv (Cornell University)|2024. 02. 08.

Algorithms and Data Compression인용 수 5

한 줄 요약

DiscDiff는 이산 DNA 서열 생성을 위한 잠재 확산 프레임워크를 도입하고, 잠재-입력 반올림 오차를 보정하기 위한 Absorb-Escape를 보강하며, 새로운 다종 DNA 데이터셋(EPD-GenDNA)에서 평가합니다.

ABSTRACT

This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.

연구 동기 및 목표

데이터의 희소성 및 평가의 어려움으로 인한 DNA 서열에 대한 생성 모델링의 필요성을 제시한다.
이산 DNA 데이터에 적합한 LDM인 DiscDiff와 잠재-입력 반올림 오류를 수정하는 Absorb-Escape를 제안한다.
멀티-종 평가를 위한 대규모, 종 간 DNA 생성 데이터셋(EPD-GenDNA)을 구축하고 벤치마크를 제시한다.

제안 방법

DiscDiff는 DNA 서열을 연속 잠재 공간으로 매핑하기 위해 두 단계의 VAE를 사용한다.
잠재 공간에서 노이즈를 예측하는 잠재 확산 소거 모델과 시퀀스를 재구성하는 고정된 디코더를 사용한다.
Absorb-Escape 사후 학습 정제는 사전 학습된 자기회귀 모델을 사용하여 낮은 확률 지역을 보정한다.
프레임워크에는 무조건 생성 및 조건부 생성(종으로 조건화) 설정이 포함된다.
평가는 모티프 분포 상관관계, 다양성 지표, 잠재 공간의 S-FID를 사용한다.
VAE 아키텍처와 확산 구성 요소를 비교하는 제거 연구를 수행한다.

Figure 1: A comparison of Motif frequency distributions. The graphs contrast the occurrences of TATA-Box and Initiator motifs at each position in a set of samples from natural DNA against those generated by various models. A close match in frequency distributions suggests a higher realism and better

실험 결과

연구 질문

RQ1잠재 확산 모델이 짧은 시퀀스와 긴 시퀀스 구간에서 기존 확산 베이스라인보다 현실적인 DNA 서열을 더 잘 생성할 수 있는가?
RQ2Absorb-Escape 사후 처리로 국소 뉴클레오타이드 정확도와 모티프 분포의 현실성이 향상되는가?
RQ3다종 간 조건부 생성에서 DiscDiff의 성능은 자기회귀 베이스라인과 비교해 어떠한가?
RQ4다양한 종의 DNA 서열 생성 품질과 다양성을 가장 잘 포착하는 데이터셋과 지표는 무엇인가?

주요 결과

모델	S-FID(소형)	Cor_TATA(소형)	Delta_Div(소형)	S-FID(대형)	Cor_TATA(대형)	Delta_Div(대형)
Random	119.0	-0.241	29.3%	106.0	0.030	13.0%
Sample from Training Set	0.509	1.0	0%	0.100	0.999	0%
VAE	295.0	-0.167	0.40%	250.0	0.007	10.6%
BitDiffusion	405	0.058	44.9%	100.0	0.066	2.00%
D3PM (small)	97.4	0.0964	28.0%	94.5	0.307	0.10%
DDSM (Time Dilation)	504.0	0.897	40.6%	1113.0	0.839	13.0%
DiscDiff (Ours)	57.4	0.973	4.40%	45.2	0.858	4.20%
Absorb-Escape (Ours)	3.21	0.975	5.70%	4.38	0.892	1.90%

DiscDiff는 짧은 시퀀스와 긴 시퀀스 모두에서 확산 모델 중 최첨단 성능을 달성했으며(S-FID 및 모티프 상관관계가 개선됨).
Absorb-Escape는 특히 긴 시퀀스에서 낮은 확률 영역을 자기회귀 보정으로 다듬어 생성 품질을 추가로 향상시킨다.
DiscDiff는 무조건 생성에서 여러 baselines(D3PM, BitDiffusion, DDSM 등)를 상회하는 성능을 데이터셋 규모에 상관없이 보였다.
조건부 생성에서 Absorb-Escape는 모티프 경향 재현을 향상시키고 모티프 분포를 균형 있게 조정하는 데 도움을 준다(TATA 박스 vs Initiator).
EPD-GenDNA는 15종, 16만 개 서열의 대규모 다종 DNA 생성 데이터셋으로 벤치마킹에 사용된다.

Figure 2: Generation Task with EPD-GenDNA. (a) Dataset: The EPD-GenDNA dataset includes 160K unique sequences from 15 species and 30 million samples with associated metadata. (b) Generative Modelling: A probabilistic model $p_{\theta}(s)$ is trained to generate new DNA sequences. (c) Model Evaluatio

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.