QUICK REVIEW

[논문 리뷰] CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Vincent Wan, Chun-an Chan|arXiv (Cornell University)|2019. 05. 17.

Speech Recognition and Synthesis참고 문헌 29인용 수 51

한 줄 요약

CHiVE는 다양한 운율 특징을 생성하고 문장 간 운율 이전을 가능하게 하는 언어학적으로 주도된 동적 계층적 조건부 변분 오토인코더를 도입하여 비계층적 기준선보다 자연성을 향상시킨다.

ABSTRACT

The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder and decoder parts of the auto-encoder are hierarchical, in line with the linguistic structure, with layers being clocked dynamically at the respective rates. We show in our experiments that our dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.

연구 동기 및 목표

TTS에서 평균화로 인한 손실을 피하기 위해 발화별 운율 변화를 모델링할 필요성을 제시한다.
언어 구조(단어, 음절, 음소)와 정합되는 동적 시계작동(clockwork) 형태의 계층적 VAE를 제안한다.
운율 변 Variation를 포착하고 샘플링하기 위한 문장 수준의 운율 임베딩을 학습한다.
참조 문장의 운율을 다른 텍스트 내용으로 전이하는 것을 가능하게 한다.
계층적 구조가 평면(일반) 기준선보다 더 자연스럽고 표현력 있는 운율을 생성함을 보여준다.

제안 방법

CHiVE, encoder, variational layer, decoder를 갖춘 CLOCKWORK 계층적 조건부 변분 오토인코더를 제안한다.
언어 구조를 반영하기 위해 프레임/음소/음절 수준에서 인코더와 디코더 모두에 계층적 RNN을 사용한다.
가우시안에서 샘플링된 문장 운율 임베딩의 평균과 분산을 출력하는 변분 계층을 삽입한다.
디코더를 언어적 특징과 샘플링된 문장 운율 임베딩으로 조건화하여 지속 시간, F0/c0 및 에너지 관련 특징을 예측한다.
지속 시간과 F0/c0에 대한 L2 손실과 변분 계층에 대한 KL 발산으로 학습한다.
추론 중에는 사전분포에서 샘플링하거나 문장을 인코딩하고, 필요시 다른 문장의 언어적 특징으로 조건화하여 운율을 전이시킨다.

실험 결과

연구 질문

RQ1동적 계층적 VAE가 TTS를 위한 의미 있는 발화별 운율 변이를 포착할 수 있는가?
RQ2언어학적으로 주도된 클록워크 계층구조가 비계층적 기준선보다 운율 모델링을 향상시키는가?
RQ3CHiVE 잠재 공간을 사용해 한 문장의 운율을 다른 문장으로 전이하는 것이 가능한가?
RQ4임베딩 유형(제로, 인코딩, 무작위)의 운율 품질과 자연성에 대한 영향은 무엇인가?

주요 결과

CHiVE의 동적 계층적 모델은 AB 대조 평가에서 비계층적 기준선보다 유의하게 선호된다(베이스라인 292 선호, CHiVE 438 선호; p = 3.91e-8).
MOS 테스트에서 CHiVE가 기준선보다 자연성을 더 높게 달성했으며 점수는: Baseline 4.01±0.11, CHiVE zero embedding 4.07±0.10, CHiVE encoded 4.25±0.10, real speech 4.67±0.07.
인코더 평균 임베딩을 사용하면 held-out 데이터에서 기준선에 비해 log F0 RMSE가 21% 감소한다.
운율 전이는 디코더를 다른 문장의 운율 임베딩으로 조건화함으로써 입증되며, log F0 윤곽에서 전이와 유사한 변화를 만들어 낸다.
제로 임베딩은 합리적이지만 인코딩된 것보다 표현력이 낮고, 무작위 임베딩은 더 다양하지만 F0 윤곽이 덜 정확한 경향이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.