QUICK REVIEW

[논문 리뷰] Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

Ryuichi Yamamoto, Eunwoo Song|arXiv (Cornell University)|2019. 10. 25.

Speech and Audio Processing참고 문헌 28인용 수 48

한 줄 요약

tldr: Parallel WaveGAN은 멀티-해상도 STFT 및 적대적 손실을 사용하는 비자 autoregressive WaveNet을 증류 없이 학습시키고, 단일 GPU에서 1.44M 파라미터로 28.68x real-time 속도에 24 kHz 음성을 생성하며, MOS는 증류 기반 방법과 비견됩니다.

ABSTRACT

We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. Furthermore, our model is able to generate high-fidelity speech even with its compact architecture. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.

연구 동기 및 목표

목표: 밀도 증류 없이 빠르고 고충실한 파형 생성 동기를 부여한다.
밀도 증류 없이 빠르고 고충실도 파형 생성을 동기 부여한다.
밀도 증류 없이 빠르고 고충실한 파형 생성을 동기 부여한다.

제안 방법

생성기로 음향 특징에 조건화된 비자가회귀적이지 않은 WaveNet을 사용한다.
웨이브폼 도메인 적대적 손실과 다중 해상도 STFT 보조 손실의 조합으로 학습한다.
안정적인 학습을 위해 Least-Squares GAN 구성 방식을 채택한다.
다른 FFT 크기, 창 크기, 프레임 시프트를 가진 다중 해상도 STFT 손실을 적용하여 시-주파수 특성을 포착한다.
생성기 학습을 위해 L_G = L_aux + lambda_adv * L_adv를 공동으로 최적화한다.
자거회귀 WaveNet 및 ClariNet 기준선과 비교하고 TTS 설정에서 MOS로 평가한다.

실험 결과

연구 질문

RQ1증류 없이 GAN 기반 보코더가 증류 기반 시스템에 비해 지각적 품질에서 경쟁력을 얻을 수 있는가?
RQ2다중 해상도 STFT 손실이 병렬 파형 생성기에서 시-주파수 음성 특성의 학습을 향상시키는가?
RQ3고충실도를 유지하면서 2단계 교사-학생 프레임워크보다 학습 과정이 더 간단하고 빠른가?
RQ4Transformer 기반 TTS 프레임워크 내에서 보코더로서 Parallel WaveGAN의 성능은 어떠한가?

주요 결과

System	Model	KLD-based distillation	STFT loss	Adversarial loss	Number of layers	Model size	Inference speed	MOS (95% CI)
System 1	WaveNet	-	-	-	24	3.81 M	0.32×10^-2	3.61 ± 0.12
System 2	ClariNet	Yes	L_s^(1)	-	60	2.78 M	14.62	3.88 ± 0.11
System 3	ClariNet	Yes	L_s^(1)+L_s^(2)+L_s^(3)	-	60	2.78 M	14.62	4.21 ± 0.09
System 4	ClariNet	Yes	L_s^(1)+L_s^(2)+L_s^(3)	Yes	60	2.78 M	14.62	4.21 ± 0.09
System 5	Parallel WaveGAN	-	L_s^(1)	Yes	30	1.44 M	28.68	1.36 ± 0.07
System 6	Parallel WaveGAN	-	L_s^(1)+L_s^(2)+L_s^(3)	Yes	30	1.44 M	28.68	4.06 ± 0.10
System 7	Recording	-	-	-	-	-	-	4.46 ± 0.08

Parallel WaveGAN은 단일 V100 GPU에서 1.44M 파라미터로 28.68x real-time 속도로 24 kHz 음성 생성을 달성한다.
STFT 손실만 사용한 Parallel WaveGAN의 MOS는 4.06이고, STFT 손실을 사용한 ClariNet은 4.21이다.
Parallel WaveGAN의 학습 시간은 2.8일로, WaveNet(7.4일)과 ClariNet(12.7일)보다 빠르다.
Transformer TTS 모델에 통합했을 때 Parallel WaveGAN의 MOS는 4.16으로, ClariNet-GAN(4.14) 및 ClariNet(4.00)과 경쟁력이 있다.
다중 해상도 STFT 손실은 지각 품질에서 단일 해상도 STFT 손실 및 자동회귀 WaveNet보다 우수하다.
적대적 손실은 Transformer 기반 TTS에서 견고성 이점을 제공하지만, 독립적 분석/합성에서의 이점은 덜 명확하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.