[论文解读] DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
DiffuSeq 引入一种用于 Seq2Seq 文本生成的无分类器扩散模型,实现并行(非自回归)解码,质量强、多样性显著,并将扩散与自回归及迭代-NAR 框架联系起来。
Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is under-explored due to the discrete nature of texts, especially for conditional generation. We tackle this challenge by proposing DiffuSeq: a diffusion model designed for sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DiffuSeq is its high diversity during generation, which is desired in many Seq2Seq tasks. We further include a theoretical analysis revealing the connection between DiffuSeq and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks. Code is available at \url{https://github.com/Shark-NLP/DiffuSeq}
研究动机与目标
- 在 Seq2Seq 任务中推动离散、带条件的文本生成的扩散模型研究。
- 开发一个无分类器的扩散模型,使其在不依赖外部分类器的情况下对源序列进行条件建模。
- 实现非自回归并行解码,以在保持质量的同时提高多样性。
- 建立 DiffuSeq 与自回归、迭代-NAR、全-NAR 模型之间的理论联系。
- 在多项 Seq2Seq 任务上展示经验有效性。
提出的方法
- 将离散文本对(源与目标)嵌入到一个共享的连续空间,并应用一个部分加噪前向过程,该过程仅扰动目标部分。
- 用基于 Transformer 的网络建模反向去噪,以学习 pθ(z t−1|z t) ,且不使用辅助分类器(无分类器)。
- 使用统一的 Emb(wx ⊕ wy) 嵌入来对源表示和目标表示进行联合训练。
- 推导并最小化变分下界 L_VLB,采用简化目标,强调 y0 重构与嵌入一致性。
- 对扩散步骤应用重要性采样以稳定训练,并采用 MBR 解码以提升最终质量。
- 建立与自回归、迭代-NAR 及全-NAR 模型的联系,论证 DiffuSeq 扩展了迭代-NAR。
实验结果
研究问题
- RQ1扩散模型是否可以在无需分类器的情况下有效适应到条件化的 Seq2Seq 文本生成?
- RQ2部分加噪前向过程如何影响条件生成以及源 wx 与目标 wy 之间的依赖建模?
- RQ3DiffuSeq 与自回归、迭代-NAR、全-NAR 模型之间的关系是什么,DiffuSeq 是否在质量和多样性方面提供了优势?
- RQ4对 wx 和 wy 的共享嵌入的联合训练是否比解耦或预提取的表示有所提升?
- RQ5基于扩散的 Seq2Seq 模型在标准 Seq2Seq 任务中是否在质量与更强的多样性上具有竞争力?
主要发现
| 任务 | 方法 | BLEU ↑ | R-L ↑ | 分数 ↑ | dist-1 ↑ | selfB ↓ / div-4 ↑ | 长度 |
|---|---|---|---|---|---|---|---|
| Open Domain Dialogue | GRU-attention ⋄ | 0.0068 | 0.1054 | 0.4128 | 0.8998 | 0.8008/0.1824 | 4.46 |
| Open Domain Dialogue | Transformer-base ⋄ | 0.0189 | 0.1039 | 0.4781 | 0.7493 | 0.3698/0.6472 | 19.5 |
| Open Domain Dialogue | GPT2-base FT ∙ | 0.0108 | 0.1508 | 0.5279 | 0.9194 | 0.0182/0.9919 | 16.8 |
| Open Domain Dialogue | GPT2-large FT ∙ | 0.0125 | 0.1002 | 0.5293 | 0.9244 | 0.0213/0.9938 | 16.8 |
| Open Domain Dialogue | GPVAE-T5 ∙ | 0.0110 | 0.1009 | 0.4317 | 0.5625 | 0.3560/0.5551 | 20.1 |
| Open Domain Dialogue | NAR-LevT ‡ | 0.0158 | 0.0550 | 0.4760 | 0.9726 | 0.7103/0.1416 | 4.11 |
| Open Domain Dialogue | DiffuSeq (Ours) ‡ | 0.0139 | 0.1056 | 0.5131 | 0.9467 | 0.0144 / 0.9971 | 13.6 |
| Question Generation | GRU-attention ⋄ | 0.0651 | 0.2617 | 0.5222 | 0.7930 | 0.9999/0.3178 | 10.1 |
| Question Generation | Transformer-base ⋄ | 0.1663 | 0.3441 | 0.6307 | 0.9309 | 0.3265/0.7720 | 10.3 |
| Question Generation | GPT2-base FT ∙ | 0.0741 | 0.2714 | 0.6052 | 0.9602 | 0.1403 / 0.9216 | 10.0 |
| Question Generation | GPT2-large FT ∙ | 0.1110 | 0.3215 | 0.6346 | 0.9670 | 0.2910/0.8062 | 9.96 |
| Question Generation | GPVAE-T5 ∙ | 0.1251 | 0.3390 | 0.6308 | 0.9381 | 0.3567/0.7282 | 11.4 |
| Question Generation | NAR-LevT ‡ | 0.0930 | 0.2893 | 0.5491 | 0.8914 | 0.9830/0.4776 | 6.93 |
| Question Generation | DiffuSeq (Ours) ‡ | 0.1731 | 0.3665 | 0.6123 | 0.9056 | 0.2789 / 0.8103 | 11.5 |
| Text Simplification | GRU-attention ⋄ | 0.3256 | 0.5602 | 0.7871 | 0.8883 | 0.9998/0.3313 | 18.9 |
| Text Simplification | Transformer-base ⋄ | 0.2693 | 0.4907 | 0.7381 | 0.8886 | 0.6924/0.5095 | 18.5 |
| Text Simplification | GPT2-base FT ∙ | 0.3083 | 0.5461 | 0.8021 | 0.9439 | 0.5444/0.6047 | 16.1 |
| Text Simplification | GPT2-large FT ∙ | 0.2693 | 0.5111 | 0.7882 | 0.9464 | 0.6042/0.5876 | 15.4 |
| Text Simplification | GPVAE-T5 ∙ | 0.3392 | 0.5828 | 0.8166 | 0.9308 | 0.8147/0.4355 | 18.5 |
| Text Simplification | NAR-LevT ‡ | 0.2052 | 0.4402 | 0.7254 | 0.9715 | 0.9907/0.3271 | 8.31 |
| Text Simplification | DiffuSeq (Ours) ‡ | 0.3622 | 0.5849 | 0.8126 | 0.9264 | 0.4642 / 0.6604 | 17.7 |
| Paraphrase | GRU-attention ⋄ | 0.1894 | 0.5129 | 0.7763 | 0.9423 | 0.9958/0.3287 | 8.30 |
| Paraphrase | Transformer-base ⋄ | 0.2722 | 0.5748 | 0.8381 | 0.9748 | 0.4483/0.7345 | 11.2 |
| Paraphrase | GPT2-base FT ∙ | 0.1980 | 0.5212 | 0.8246 | 0.9798 | 0.5480/0.6245 | 9.67 |
| Paraphrase | GPT2-large FT ∙ | 0.2059 | 0.5415 | 0.8363 | 0.9819 | 0.7325/0.5020 | 9.53 |
| Paraphrase | GPVAE-T5 ∙ | 0.2409 | 0.5886 | 0.8466 | 0.9688 | 0.5604/0.6169 | 9.60 |
| Paraphrase | NAR-LevT ‡ | 0.2268 | 0.5795 | 0.8344 | 0.9790 | 0.9995/0.3329 | 8.85 |
| Paraphrase | DiffuSeq (Ours) ‡ | 0.2413 | 0.5880 | 0.8365 | 0.9807 | 0.2732 / 0.8641 | 11.2 |
- DiffuSeq 在四个 Seq2Seq 任务中达到与六个强基线(包括一个最先进的基于 PLM 的模型)相当或更高的质量。
- DiffuSeq 一致地获得更高的多样性(自我 BLEU 更低,div-4 更高),同时保持 BLEU、ROUGE 与 BERTScore 的竞争力。
- 该模型在句子级别上表现出强烈的多样性,当利用多样性时(例如在 MBR 的更大候选集合中)可超越自回归基线。
- 对 wx 与 wy 的共享嵌入进行联合训练对性能重要;解耦的训练策略会降低结果。
- DiffuSeq 为自回归、迭代-NAR 与扩散方法之间提供理论与经验上的桥梁,确立扩散作为条件语言生成的可行扩展。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。