QUICK REVIEW

[논문 리뷰] Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Gustav Eje Henter, Jaime Lorenzo-Trueba|arXiv (Cornell University)|2018. 07. 30.

Speech Recognition and Synthesis참고 문헌 79인용 수 51

한 줄 요약

이 논문은 엔코더-디코더 및 변분 자동인코더 프레임워크를 사용하여 음성 합성에서 제어 가능한 출력을 학습하는 비지도 방법을 조사하고, 기존 휴리스틱을 확률 잠재변수 모델 및 VQ-VAE와 연결합니다. 이러한 비지도 방법이 감정 음성 합성에서 지도 학습 방법과 대등하거나 이를 능가할 수 있음을 보여줍니다.

ABSTRACT

Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example, we show that popular unsupervised training heuristics can be interpreted as variational inference in certain autoencoder models. We additionally connect these models to VQ-VAEs, another, recently-proposed class of deep variational autoencoders, which we show can be derived from a very similar mathematical argument. The implications of these new probabilistic interpretations are discussed. We illustrate the utility of the various approaches with an application to acoustic modelling for emotional speech synthesis, where the unsupervised methods for learning expression control (without access to emotional labels) are found to give results that in many aspects match or surpass the previous best supervised approach.

연구 동기 및 목표

주석이 없는 변이에서 학습하여 텍스트 주석을 넘어 제어 가능한 음성 합성을 고무한다.
기존의 비지도 제어 방법에 대한 확률적 해석을 확립한다.
일반적으로 사용되는 휴리스틱을 변분 자동인코더 및 VQ-VAE와 연결한다.
대규모 감정 음성 데이터베이스에서 비지도 제어 방법을 지도 기반 기준방법과 비교 평가한다.

제안 방법

제어 문제를 텍스트 입력을 가진 음성 합성의 잠재변수 모형으로 프레이밍한다.
변분 추론을 사용하여 하한을 도출하고 학습 휴리스틱을 근사 최대우도 추정으로 해석한다.
DCC-유사 제어와 VQ-VAE 프레임워크 간의 동등성/연결성을 보인다.
비지도 제어 방법에 사전 정보를 통합하는 방법을 논의한다.
감정 음성에 대해 경험적으로 평가하여 지도 학습 시스템과의 비교를 수행한다.

실험 결과

연구 질문

RQ1감정 라벨이 없어도 잠재 제어 변수의 비지도 학습으로 제어 가능한 음성을 생성할 수 있는가?
RQ2기존의 비지도 제어 휴리스틱은 변분 추론 및 VQ-VAE 원리와 어떻게 관련되는가?
RQ3비지도 방법은 표현적(감정적) 음성 합성에서 지도 모델과 동등하거나 더 우수한가?

주요 결과

비지도 제어 방법은 변분 하한을 통해 근사 최대우도 추정기로 해석될 수 있다.
일반적인 인코더-디코더 방식과 VQ-VAE 프레임워크 간에 이론적 연결이 존재한다.
사전 정보를 휴리스틱한 비지도 방법에 통합할 수 있다.
대규모 감정 음성 데이터베이스에 대한 실험에서 비지도 방법이 경쟁력 있는 지도 시스템과 동등하거나 더 우수하게 작동하는 것을 보여준다.
감정 관련 음향 모델링에서 비지도 방식이 기존의 최고 성능 지도 방법에 부합하거나 이를 능가하는 결과를 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.