[论文解读] ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech
ProDiff 直接在生成器基础的扩散下预测干净数据,并使用知识蒸馏将扩散步骤减半,在单个GPU上两次迭代获得高质量的梅尔谱,并实现约24x 超实时语音合成速度。
Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}
研究动机与目标
- 评估用于 TTS 的扩散参数化并识别采样速度与质量的瓶颈。
- 提出 ProDiff,一种基于生成器的扩散模型,结合知识蒸馏以减少迭代次数。
- 证明 ProDiff 在显著减少采样步骤的同时仍保持多样性,达到高保真度。
- 在标准基准和消融实验中,将 ProDiff 与最先进的 TTS 模型进行比较评估。
提出的方法
- 比较用于 TTS 去噪的基于梯度与基于生成器的扩散参数化。
- 引入带有生成器基础去噪的 ProDiff 以避免分数匹配梯度估计。
- 使用来自 N 步教师(DDIM)的知识蒸馏来训练 N/2 步的学生,在目标端降低方差。
- 在 FastSpeech 2 架构上构建 ProDiff,包含一个谱图去噪器和结合重建、SSIM 与方差项的专门训练损失。
- 用来自 4 步教师的 DDIM 基目标进行训练,然后蒸馏为 2 步学生;采用额外损失(SSIM、时长/音高/能量)以提升质量。
- 通过在每步预测 x0 并通过后验重建 x_{t-1} 来推断,然后用 vocoder 合成波形。
实验结果
研究问题
- RQ1基于生成器的扩散在 TTS 中是否能在保持或提升音质的情况下实现比基于梯度的扩散更快的采样?
- RQ2知识蒸馏从一个 N 步教师到一个 N/2 步学生,是否能稳定训练并在不牺牲多样性的情况下加速推理?
- RQ3在标准基准上,ProDiff 相对于自回归和非自回归 TTS 模型在质量、速度和多样性方面表现如何?
主要发现
- ProDiff 仅用 2 步扩散即可获得高质量的梅尔谱图。
- 基于生成器的参数化在低步数时在对采样加速的鲁棒性方面优于基于梯度的参数化。
- 从 4 步教师到 2 步学生的知识蒸馏降低方差并使采样加速达到数量级级别。
- 在 LJSpeech 数据集上,2 步 ProDiff 在感知质量和多样性方面与若干基线相当甚至超越,同时在单个 2080Ti GPU 上实现约 24x 的超实时采样速度。
- ProDiff 即使在采用数百步扩散的最先进模型中,也能保持有竞争力的样本质量和多样性,且可扩展到多说话人设置。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。