QUICK REVIEW

[論文レビュー] SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Junyi Ao, Rui Wang|arXiv (Cornell University)|Oct 14, 2021

Speech Recognition and Synthesis被引用数 30

ひとこと要約

SpeechT5 は、共用モデルを用いたエンコーダ-デコーダ事前学習フレームワークを提示し、モーダル特定の事前/事後ネットを持つ共有モデルを用いて音声とテキストの跨モーダル表現を学習し、ASR、TTS、ST、VC、SE、SID など多様な話す言語タスクを可能にします。

ABSTRACT

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.

研究の動機と目的

話す言語処理のために、ラベルなしの音声とテキストデータの両方を活用する統一エンコーダ-デコーダ事前学習フレームワークを動機づけ、設計する。
共有コードブックと混合潜在表現を介して音声とテキスト間の跨モーダル整合を開発する。
広範な実験を通じて SpeechT5 の有効性を示す。
共同事前学習と跨モーダルコンポーネントの寄与を検証するアブレーションを提供する。

提案手法

音声/テキスト入力と出力を扱うために、6つのモーダル特有の事前/事後ネットを備えた統一エンコーダ-デコーダ・バックボーンを採用する。
ラベルなしの音声とテキストデータ上での denoising sequence-to-sequence 目的で事前学習を行い、双方向マスク付き予測と音声に対する seq2seq 再構成損失を含む。
音響表現とテキスト表現を整合させるための共有コードブックを用いた跨モーダルベクトル量子化と、コード利用を促進する多様性損失を導入する。
適切な事前/事後ネットを取り付けて下流タスク向けにエンコーダ-デコーダ・バックボーンをファインチューニングする（例：ASR、TTS、ST、VC、SE、SID）。
自己注意機構における相対位置埋め込み、wav2vec 2.0風の音声事前ネットワーク、および生成タスクの波形生成のためのボコーダを利用する。

実験結果

リサーチクエスチョン

RQ1ラベルなしの音声とテキストで事前学習した単一の統一エンコーダ-デコーダモデルは、広範な話す言語処理タスクを効果的にサポートできるか。
RQ2跨モーダルベクトル量子化は、ASRやTTSのような跨モーダルタスクの整合性と性能を改善するか。
RQ3共同の音声-テキスト事前学習と単一モーダル事前学習の下流タスクに与える影響はどの程度か。
RQ4SpeechT5 は ASR、TTS、ST、VC、SE、SID などで最先端のベースラインと比べてどうか。

主な発見

モデル	dev-clean WER	dev-other WER	test-clean WER	test-other WER
wav2vec 2.0 Base	-	6.1	13.5	6.1	13.3
HuBERT Base	-	5.5	13.1	5.8	13.3
Baseline (w/o CTC)	-	5.8	12.3	6.2	12.3
Baseline	-	4.9	11.7	5.0	11.9
SpeechT5 (w/o CTC)	-	5.4	10.7	5.8	10.7
SpeechT5	-	4.3	10.3	4.4	10.4

SpeechT5 は LM 融合を伴う ASR で wav2vec 2.0 Base および HuBERT Base を上回り、報告された実験でより低い WER を達成している（Table 1）。
SpeechT5 は TTS の自然さと MOS が高く、基準モデルに対して CMOS 増分を示す。
SpeechT5 は ST で改善を示し、EN-DE および EN-FR の BLEU スコアが複数のベースラインを上回る（Table 4）。
SpeechT5 は VC および SE タスクで従来手法を上回り、Mel-Cepstral Distortion および WER 指標で競争力を示す（Table 2、Table 5）。
SID では VoxCeleb1 上で最先端の精度（96.49%）を達成する（Table 6）。
アブレーション研究では、いずれかの事前学習成分を除くと ASR、VC、SID の性能が低下し、特に音声事前学習と共同事前学習の影響が大きいことが示されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。