QUICK REVIEW

[논문 리뷰] ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Rongjie Huang, Zhou Zhao|arXiv (Cornell University)|2022. 07. 13.

Speech Recognition and Synthesis인용 수 21

한 줄 요약

ProDiff는 생성기 기반 확산으로 깨끗한 데이터를 직접 예측하고 지식 증류를 사용해 확산 단계 수를 절반으로 줄여 단일 GPU에서 2회의 반복으로 고품질 멜 스펙트로그램을 얻고 실시간보다 약 24배 빠른 음성 합성을 달성합니다.

ABSTRACT

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}

연구 동기 및 목표

TTS를 위한 확산 매개변수화를 평가하고 샘플링 속도와 품질의 병목을 식별한다.
샘플링 반복 수를 줄이기 위한 지식 증류를 갖춘 생성기 기반 확산 모델 ProDiff를 제안한다.
ProDiff가 다양성을 유지하면서도 샘플링 단계 수를 크게 줄여 높은 충실도를 달성함을 보여준다.
표준 벤치마크 및 제거(ablations)에서 ProDiff를 최첨단 TTS 모델과 비교 평가한다.

제안 방법

TTS 디노이징을 위한 gradient 기반 확산 매개변수화와 생성기 기반 매개변수화를 비교한다.
점수 매핑 그래디언트 추정 없이 생성기 기반 디노이징으로 ProDiff를 도입한다.
N단계 교사(DDIM)로부터의 지식 증류를 사용해 학생을 N/2단계로 학습시켜 타깃 측의 분산을 감소시킨다.
FastSpeech 2 구조에 spectrogram denoiser와 재구성, SSIM, 분산 항을 결합한 도메인 트레이닝 손실을 적용한다.
4단계 교사에서 생성된 DDIM 기반 타깃으로 학습한 뒤 2단계 학생으로 증류하고, 품질 향상을 위해 SSIM, 지속시간/음높이/에너지 손실 등을 추가한다.
각 단계에서 x0를 예측하고 후방 분포를 통해 x_{t-1}를 재구성한 뒤 보코더로 파형을 합성한다.

실험 결과

연구 질문

RQ1생성기 기반 확산이 TTS에서 그래디언트 기반 확산에 비해 샘플링 속도를 높이고 오디오 품질을 보존하거나 개선할 수 있는가?
RQ2N단계 교사에서 N/2단계 학생으로의 지식 증류가 훈련 안정성과 추론 속도를 크게 개선하면서 다양성을 해치지 않는가?
RQ3ProDiff는 표준 벤치마크에서 자 autoregressive 및 비-autoregressive TTS 모델과 비교했을 때 품질, 속도, 다양성에서 어떤 성능을 보이는가?

주요 결과

방법	MOS	MCD	STOI	PESQ	NDB	JS	RTF
GT	4.41 ± 0.06	/	/	/	/	/	/
GT(voc.)	4.25 ± 0.06	1.08	0.95	3.18	0.23	0.002	/
Tacotron 2	3.90 ± 0.07	5.30	0.18	1.14	0.88	0.022	/
FastSpeech 2	3.92 ± 0.05	4.06	0.23	0.99	0.79	0.021	0.01
GANSpeech	4.00 ± 0.05	4.02	0.21	0.96	0.73	0.104	0.02
Glow-TTS	4.01 ± 0.07	4.35	0.19	1.00	0.74	0.012	0.01
Grad-TTS (64 steps)	4.05 ± 0.06	3.36	0.19	1.48	0.57	0.023	0.19
DiffSpeech (128 steps)	4.09 ± 0.06	3.48	0.83	2.40	0.67	0.008	1.11
ProDiff (2 steps)	4.08 ± 0.07	3.15	0.85	2.55	0.69	0.012	0.04

ProDiff는 단 2개의 확산 단계로 고품질 멜 스펙트로그램을 달성한다.
생성기 기반 매개변수화는 낮은 단계 수에서 샘플링 가속에 대한 강건성 측면에서 그래디언트 기반 매개변화보다 우수하다.
4단계 교사에서 2단계 학생으로의 지식 증류가 분산을 감소시키고 샘플링 속도를 큰 폭으로 가속한다.
LJSpeech에서 ProDiff는 2단계로 여러 기준에서 주관적 품질 및 다양성에서 상정보다 같거나 우수하며 단일 2080Ti GPU에서 약 24배 빠른 실시간보다 빠른 샘플링을 제공한다.
ProDiff는 수백 단계의 최첨단 모델과 견줄 만한 샘플 품질과 다양성을 유지하며 다중 화자 설정으로 확장 가능하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.