[Paper Review] Sample Efficient Adaptive Text-to-Speech
This paper introduces meta-learning based strategies to adapt a multi-speaker WaveNet TTS model to new speakers with very little data, achieving high naturalness and speaker similarity via three adaptation methods: embedding fine-tuning, full-model fine-tuning, and an embedding encoder approach.
We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
Motivation & Objective
- Motivate and address the challenge of rapid, high-quality TTS adaptation to new speakers with limited data.
- Develop a meta-learning framework that learns a speaker-conditional WaveNet prior rather than a fixed final model.
- Explore three adaptation strategies to tailor the model to new voices with few examples.
Proposed method
- Extend WaveNet with per-speaker embeddings for each speaker in a large multi-speaker model.
- Three adaptation strategies: (i) SEA-Emb — fine-tune only the speaker embedding with core WaveNet fixed, (ii) SEA-All — fine-tune both embedding and full model, (iii) SEA-Enc — train an encoder to predict the new speaker embedding from demonstration data.
- Normalize f0 to reduce speaker identity leakage from pitch features.
- Use two held-out adaptation datasets (LibriSpeech and VCTK) to evaluate few-shot adaptation under different data regimes.
- Compare against prior few-shot TTS methods and report both naturalness (MOS) and speaker similarity (MOS and TI-SV EER).
Experimental results
Research questions
- RQ1Can a multi-speaker WaveNet trained with a shared core and per-speaker embeddings be rapidly adapted to unseen speakers with only seconds to minutes of data?
- RQ2How do non-parametric (SEA-Emb, SEA-All) and parametric (SEA-Enc) adaptation strategies compare in terms of naturalness and speaker similarity?
- RQ3What is the impact of adaptation data size on the quality and speaker-discriminability of generated voices?
- RQ4Does the adapted model generalize across datasets recorded under different conditions (LibriSpeech vs. VCTK)?
Key findings
- All three adaptation approaches enable high-quality speech for new speakers using only seconds to minutes of adaptation data.
- SEA-All (full-model fine-tuning after embedding optimization) delivers the strongest performance among the three methods across datasets and data regimes.
- SEA-Emb adapts quickly with fewer parameters and is less prone to overfitting, while SEA-All tends to achieve the best naturalness and speaker similarity with enough adaptation data.
- SEA-Enc provides a fast, transcript-independent adaptation pathway but can be biased by encoder capacity, generally performing worse on naturalness and speaker similarity than non-parametric methods in the reported settings.
- Qualitative analyses show generated voices cluster by speaker in the TI-SV embedding space and can approach real utterances in speaker verification tasks, especially on LibriSpeech with sufficient adaptation data.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.