Skip to main content
QUICK REVIEW

[论文解读] SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Junyi Ao, Rui Wang|arXiv (Cornell University)|Oct 14, 2021
Speech Recognition and Synthesis被引用 30
一句话总结

SpeechT5 presents a unified encoder-decoder pre-training framework that learns cross-modal representations for both speech and text using a shared model with modal-specific pre/post-nets, enabling diverse spoken language tasks including ASR, TTS, ST, VC, SE, and SID.

ABSTRACT

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.

研究动机与目标

  • Motivate and design a unified encoder-decoder pre-training framework for spoken language processing that leverages both unlabeled speech and text data.
  • Develop cross-modal alignment between speech and text via a shared codebook and mixed latent representations.
  • Demonstrate the effectiveness of SpeechT5 across a wide range of downstream tasks through extensive experiments.
  • Provide ablations to validate the contribution of the joint pre-training and cross-modal components.

提出的方法

  • Adopt a unified encoder-decoder backbone with six modal-specific pre/post-nets to handle speech/text inputs and outputs.
  • Pre-train with a denoising sequence-to-sequence objective on unlabeled speech and text data, including bidirectional masked prediction and a seq2seq reconstruction loss for speech.
  • Introduce cross-modal vector quantization with a shared codebook to align acoustic and textual representations and a diversity loss to encourage code usage.
  • Fine-tune the encoder-decoder backbone for downstream tasks by attaching the appropriate pre/post-nets (e.g., for ASR, TTS, ST, VC, SE, SID).
  • Use relative position embeddings in self-attention, wav2vec 2.0-like speech pre-net, and a vocoder for waveform generation in generation tasks.

实验结果

研究问题

  • RQ1Can a single unified encoder-decoder model pre-trained on unlabeled speech and text effectively support a broad set of spoken language processing tasks?
  • RQ2Does cross-modal vector quantization improve alignment and performance for cross-modal tasks such as ASR and TTS?
  • RQ3What is the impact of joint speech-text pre-training vs. single-modal pre-training on downstream spoken language tasks?
  • RQ4How does SpeechT5 perform relative to state-of-the-art baselines across ASR, TTS, ST, VC, SE, and SID?

主要发现

模型dev-clean WERdev-other WERtest-clean WERtest-other WER
wav2vec 2.0 Base-6.113.56.113.3
HuBERT Base-5.513.15.813.3
Baseline (w/o CTC)-5.812.36.212.3
Baseline-4.911.75.011.9
SpeechT5 (w/o CTC)-5.410.75.810.7
SpeechT5-4.310.34.410.4
  • SpeechT5 outperforms wav2vec 2.0 Base and HuBERT Base on ASR with LM fusion, achieving lower WERs in reported experiments (Table 1).
  • SpeechT5 achieves strong TTS naturalness and MOS, with CMOS gains over a baseline model.
  • SpeechT5 yields improvements on ST, e.g., EN-DE and EN-FR BLEU scores over several baselines (Table 4).
  • SpeechT5 surpasses prior methods on VC and SE tasks, showing competitive Mel-Cepstral Distortion and WER metrics (Table 2, Table 5).
  • On SID, SpeechT5 achieves state-of-the-art accuracy (96.49%) on VoxCeleb1 (Table 6).
  • Ablation studies show that removing any pre-training component degrades ASR, VC, and SID performance, with speech pre-training and joint pre-training being especially impactful.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。