QUICK REVIEW

[논문 리뷰] AdaSpeech: Adaptive Text to Speech for Custom Voice

Mingjian Chen, Xu Tan|arXiv (Cornell University)|2021. 03. 01.

Speech Recognition and Synthesis참고 문헌 31인용 수 79

한 줄 요약

AdaSpeech는 발화 단위 및 음소 단위의 음향 조건을 모델링하고 조건부 레이어 정규화를 사용하여 적은 양의 적응 데이터로도 새로운 음성에 소스 TTS 모델을 적응시킵니다.

ABSTRACT

Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at https://speechresearch.github.io/adaspeech/.

연구 동기 및 목표

맞춤 음성의 도전 과제 해결: (1) 서로 다른 화자와 환경에서의 적응 데이터의 다양한 음향 조건 처리; (2) 대규모 사용자 기반을 위한 per-화자 매개변수의 제한된 수로 확장 가능한 적응 구현; (3) 적은 적응 샘플로도 높은 자연스러움과 화자 유사도 달성

제안 방법

FastSpeech 2를 백본으로 채택하고 두 가지 음향 조건 인코더를 도입: 글로벌 음향 조건을 포착하는 발화 수준 인코더와 지역 음향 조건을 포착하는 음소 수준 인코더를 사전 학습 및 미세 조정에 활용합니다.
추론 시 참조 음성을 통해 발화 수준의 조건을 도출하고 음소 수준의 조건은 음소 수준 음향 예측기를 통해 예측합니다.
멜-스펙트로그램 디코더에 조건부 레이어 정규화를 도입하여 스케일과 바이어스 벡터를 소형 화자 조건 네트워크에서 생성하게 하고, 아주 적은 매개변수만 미세 조정이 가능하도록 만듭니다.
적응 중에 조건부 레이어 정규화 매개변수와 화자 임베딩만 미세 조정하고 다른 구성요소는 고정합니다.
LibriTTS에서 사전 학습하고 VCTK 및 LJSpeech에서 제한된 적응 데이터(예: 약 20문장)로 미세 조정합니다.
보코더: MelGAN으로 생성된 멜-스펙트로그램에서 웨이브폼을 합성합니다.

실험 결과

연구 질문

RQ1적응 데이터가 제한된 상황에서 다양한 음향 조건 하에서 새로운 음성으로 TTS를 어떻게 적응시킬 수 있는가?
RQ2다중 해상도에서 음향 조건을 모델링하는 것이 교차 도메인 음성 적응의 품질을 향상시킬 수 있는가?
RQ3조건부 레이어 정규화가 전체 미세 조정보다 훨씬 적은 매개변수로도 고품질의 적응을 가능하게 하는가?
RQ4AdaSpeech에서 적응 데이터 크기와 음성 품질 간의 트레이드오프는 무엇인가?

주요 결과

AdaSpeech는 LibriTTS에서 VCTK 및 LJSpeech로의 적응 시 바탕이 되는 LibriTTS로부터의 적응에서 평균 MOS 및 SMOS 등에서 기준보다 더 높은 적응 품질을 달성하며, 화자별 매개변수는 약 4.9K개, 공유 매개변수는 약 1.2M개를 사용합니다.
화자 임베딩만 미세 조정하거나 디코더 전체를 미세 조정하는 baselines보다 적은 적응 매개변수로도 더 나은 성능을 보입니다.
발화 수준 음향 조건 제거, 음소 수준 음향 조건 제거, 또는 조건부 레이어 정규화 제외에 따른 음성 품질 저하로 각 구성요소의 기여를 확인합니다.
교차 도메인 적응(LibriTTS -> LJSpeech 또는 VCTK)에서 도메인 간 음향 불일치의 문제를 강조하는 MOS/SMOS 차이가 더 크게 나타납니다.
적응 파이프라인(사전 학습, CLN 매개변수 및 화자 임베딩의 미세 조정, 예측된 음소 수준 벡터를 이용한 추론)은 낮은 메모리 오버헤드로 실용적 배포를 가능하게 합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.