QUICK REVIEW

[论文解读] SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Junyi Ao, Rui Wang|arXiv (Cornell University)|Oct 14, 2021

Speech Recognition and Synthesis被引用 30

一句话总结

SpeechT5 presents a unified encoder-decoder pre-training framework that learns cross-modal representations for both speech and text using a shared model with modal-specific pre/post-nets, enabling diverse spoken language tasks including ASR, TTS, ST, VC, SE, and SID.

ABSTRACT

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.

研究动机与目标

Motivate and design a unified encoder-decoder pre-training framework for spoken language processing that leverages both unlabeled speech and text data.
Develop cross-modal alignment between speech and text via a shared codebook and mixed latent representations.
Demonstrate the effectiveness of SpeechT5 across a wide range of downstream tasks through extensive experiments.
Provide ablations to validate the contribution of the joint pre-training and cross-modal components.

提出的方法

Adopt a unified encoder-decoder backbone with six modal-specific pre/post-nets to handle speech/text inputs and outputs.
Pre-train with a denoising sequence-to-sequence objective on unlabeled speech and text data, including bidirectional masked prediction and a seq2seq reconstruction loss for speech.
Introduce cross-modal vector quantization with a shared codebook to align acoustic and textual representations and a diversity loss to encourage code usage.
Fine-tune the encoder-decoder backbone for downstream tasks by attaching the appropriate pre/post-nets (e.g., for ASR, TTS, ST, VC, SE, SID).
Use relative position embeddings in self-attention, wav2vec 2.0-like speech pre-net, and a vocoder for waveform generation in generation tasks.

实验结果

研究问题

RQ1Can a single unified encoder-decoder model pre-trained on unlabeled speech and text effectively support a broad set of spoken language processing tasks?
RQ2Does cross-modal vector quantization improve alignment and performance for cross-modal tasks such as ASR and TTS?
RQ3What is the impact of joint speech-text pre-training vs. single-modal pre-training on downstream spoken language tasks?
RQ4How does SpeechT5 perform relative to state-of-the-art baselines across ASR, TTS, ST, VC, SE, and SID?

主要发现

模型	dev-clean WER	dev-other WER	test-clean WER	test-other WER
wav2vec 2.0 Base	-	6.1	13.5	6.1	13.3
HuBERT Base	-	5.5	13.1	5.8	13.3
Baseline (w/o CTC)	-	5.8	12.3	6.2	12.3
Baseline	-	4.9	11.7	5.0	11.9
SpeechT5 (w/o CTC)	-	5.4	10.7	5.8	10.7
SpeechT5	-	4.3	10.3	4.4	10.4

SpeechT5 outperforms wav2vec 2.0 Base and HuBERT Base on ASR with LM fusion, achieving lower WERs in reported experiments (Table 1).
SpeechT5 achieves strong TTS naturalness and MOS, with CMOS gains over a baseline model.
SpeechT5 yields improvements on ST, e.g., EN-DE and EN-FR BLEU scores over several baselines (Table 4).
SpeechT5 surpasses prior methods on VC and SE tasks, showing competitive Mel-Cepstral Distortion and WER metrics (Table 2, Table 5).
On SID, SpeechT5 achieves state-of-the-art accuracy (96.49%) on VoxCeleb1 (Table 6).
Ablation studies show that removing any pre-training component degrades ASR, VC, and SID performance, with speech pre-training and joint pre-training being especially impactful.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。