QUICK REVIEW

[论文解读] DiscDiff: Latent Diffusion Model for DNA Sequence Generation

Zehui Li, Yuhao Ni|arXiv (Cornell University)|Feb 8, 2024

Algorithms and Data Compression被引用 5

一句话总结

DiscDiff 引入了用于生成离散 DNA 序列的潜在扩散框架，并通过 Absorb-Escape 来纠正潜在输入的舍入误差，在新的跨物种 DNA 数据集（EPD-GenDNA）上进行评估。

ABSTRACT

This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.

研究动机与目标

出于数据稀缺性和评估挑战，推动 DNA 序列的生成建模。
提出 DiscDiff，一种适用于离散 DNA 数据的 LDM，以及用于修正潜在输入舍入误差的 Absorb-Escape。
创建并基准化一个大型跨物种 DNA 生成数据集（EPD-GenDNA），用于多物种评估。

提出的方法

DiscDiff 使用两阶段变分自编码器（VAE）将 DNA 序列映射到连续潜在空间。
潜在扩散去噪模型在潜在空间中预测噪声，锁定的解码器重建序列。
Absorb-Escape 训练后精炼使用预训练的自回归模型来纠正低概率区域。
框架包含无条件生成和条件生成两种设置（按物种进行条件化）。
评估使用基序分布相关性、多样性指标，以及潜在空间中的 S-FID。
消融研究比较 VAE 架构和扩散组件。

Figure 1: A comparison of Motif frequency distributions. The graphs contrast the occurrences of TATA-Box and Initiator motifs at each position in a set of samples from natural DNA against those generated by various models. A close match in frequency distributions suggests a higher realism and better

实验结果

研究问题

RQ1潜在扩散模型是否能在短序列和长序列场景中生成比现有扩散基线更真实的 DNA 序列？
RQ2Absorb-Escape 后处理是否提升局部核苷酸准确性和基序分布的真实度？
RQ3与自回归基线相比，DiscDiff 在跨多个物种的条件生成中的表现如何？
RQ4哪些数据集和评估指标能够最好地捕捉跨物种生成 DNA 序列的质量与多样性？

主要发现

Model	S-FID (Small)	Cor_TATA (Small)	Delta_Div (Small)	S-FID (Large)	Cor_TATA (Large)	Delta_Div (Large)
Random	119.0	-0.241	29.3%	106.0	0.030	13.0%
Sample from Training Set	0.509	1.0	0%	0.100	0.999	0%
VAE	295.0	-0.167	0.40%	250.0	0.007	10.6%
BitDiffusion	405	0.058	0.449%	100.0	0.066	2.00%
D3PM (small)	97.4	0.0964	28.0%	94.5	0.307	0.10%
DDSM (Time Dilation)	504.0	0.897	40.6%	1113.0	0.839	13.0%
DiscDiff (Ours)	57.4	0.973	4.40%	45.2	0.858	4.20%
Absorb-Escape (Ours)	3.21	0.975	5.70%	4.38	0.892	1.90%

DiscDiff 在扩散模型中对短序列和长序列的 DNA 生成均达到最先进的效果（S-FID 与基序相关性更优）。
Absorb-Escape 进一步提升生成质量，尤其是在长序列中，通过自回归纠正来细化低概率区域。
DiscDiff 在无条件生成方面超越若干基线（如 D3PM、BitDiffusion、DDSM），在两种数据集规模上均如此。
在条件生成中，Absorb-Escape 增强了基序趋势的重复性，并实现了基序分布的平衡（TATA-box 与 Initiator）。
引入 EPD-GenDNA 作为一个大型跨物种 DNA 生成数据集（160k 序列，15 个物种）用于基准测试。

Figure 2: Generation Task with EPD-GenDNA. (a) Dataset: The EPD-GenDNA dataset includes 160K unique sequences from 15 species and 30 million samples with associated metadata. (b) Generative Modelling: A probabilistic model $p_{\theta}(s)$ is trained to generate new DNA sequences. (c) Model Evaluatio

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。