[论文解读] SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
SpeechT5 presents a unified encoder-decoder pre-training framework that learns cross-modal representations for both speech and text using a shared model with modal-specific pre/post-nets, enabling diverse spoken language tasks including ASR, TTS, ST, VC, SE, and SID.
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.
研究动机与目标
- Motivate and design a unified encoder-decoder pre-training framework for spoken language processing that leverages both unlabeled speech and text data.
- Develop cross-modal alignment between speech and text via a shared codebook and mixed latent representations.
- Demonstrate the effectiveness of SpeechT5 across a wide range of downstream tasks through extensive experiments.
- Provide ablations to validate the contribution of the joint pre-training and cross-modal components.
提出的方法
- Adopt a unified encoder-decoder backbone with six modal-specific pre/post-nets to handle speech/text inputs and outputs.
- Pre-train with a denoising sequence-to-sequence objective on unlabeled speech and text data, including bidirectional masked prediction and a seq2seq reconstruction loss for speech.
- Introduce cross-modal vector quantization with a shared codebook to align acoustic and textual representations and a diversity loss to encourage code usage.
- Fine-tune the encoder-decoder backbone for downstream tasks by attaching the appropriate pre/post-nets (e.g., for ASR, TTS, ST, VC, SE, SID).
- Use relative position embeddings in self-attention, wav2vec 2.0-like speech pre-net, and a vocoder for waveform generation in generation tasks.
实验结果
研究问题
- RQ1Can a single unified encoder-decoder model pre-trained on unlabeled speech and text effectively support a broad set of spoken language processing tasks?
- RQ2Does cross-modal vector quantization improve alignment and performance for cross-modal tasks such as ASR and TTS?
- RQ3What is the impact of joint speech-text pre-training vs. single-modal pre-training on downstream spoken language tasks?
- RQ4How does SpeechT5 perform relative to state-of-the-art baselines across ASR, TTS, ST, VC, SE, and SID?
主要发现
| 模型 | dev-clean WER | dev-other WER | test-clean WER | test-other WER | |
|---|---|---|---|---|---|
| wav2vec 2.0 Base | - | 6.1 | 13.5 | 6.1 | 13.3 |
| HuBERT Base | - | 5.5 | 13.1 | 5.8 | 13.3 |
| Baseline (w/o CTC) | - | 5.8 | 12.3 | 6.2 | 12.3 |
| Baseline | - | 4.9 | 11.7 | 5.0 | 11.9 |
| SpeechT5 (w/o CTC) | - | 5.4 | 10.7 | 5.8 | 10.7 |
| SpeechT5 | - | 4.3 | 10.3 | 4.4 | 10.4 |
- SpeechT5 outperforms wav2vec 2.0 Base and HuBERT Base on ASR with LM fusion, achieving lower WERs in reported experiments (Table 1).
- SpeechT5 achieves strong TTS naturalness and MOS, with CMOS gains over a baseline model.
- SpeechT5 yields improvements on ST, e.g., EN-DE and EN-FR BLEU scores over several baselines (Table 4).
- SpeechT5 surpasses prior methods on VC and SE tasks, showing competitive Mel-Cepstral Distortion and WER metrics (Table 2, Table 5).
- On SID, SpeechT5 achieves state-of-the-art accuracy (96.49%) on VoxCeleb1 (Table 6).
- Ablation studies show that removing any pre-training component degrades ASR, VC, and SID performance, with speech pre-training and joint pre-training being especially impactful.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。