QUICK REVIEW

[論文レビュー] SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

Ankur Bapna, Yu-An Chung|arXiv (Cornell University)|Oct 20, 2021

Topic Modeling参考文献 60被引用数 50

ひとこと要約

SLAMは自己教師あり学習と整合性 losses を組み合わせて、音声とテキストの両方を共同で事前学習する単一のエンコーダを訓練し、クロスモーダル干渉と容量制約を検討しつつ、音声翻訳を改善することを目指す。

ABSTRACT

Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST~2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks. On four GLUE tasks and text-normalization, we observe evidence of capacity limitations and interference between the two modalities, leading to degraded performance compared to an equivalent text-only model, while still being competitive with BERT. Through extensive empirical analysis we also demonstrate the importance of the choice of objective function for speech pre-training, and the beneficial effect of adding additional supervised signals on the quality of the learned representations.

研究の動機と目的

モーダリティ間（音声とテキスト）に跨る普遍的な自己教師付き事前学習を動機づける。
単一のエンコーダが両モダリティの強い表現を学習できるかを調査する。
整合性 losses（TLM と STM）がクロスモーダル転移と下流タスクへ与える影響を評価する。
2つの高リソースモダリティを同時にモデル化する際の干渉と容量の限界を特徴づける。
マルチモーダル事前学習の設計指針と経験的洞察を提供する。

提案手法

音声エンコーダ、テキストエンコーダ、および共有マルチモーダルエンコーダを備えた単一の Conformer ベースアーキテクチャを提案する。
四つの目的で事前学習する：SpanBERT（テキスト MLM）、w2v-BERT（音声）、Paired データ上の Translation Language Modeling（TLM）、Paired/Non-paired データ上の Speech-Text Matching（STM）。
自律的で非ペアデータ上の自己教師付き学習を最初に行い、その後に非ペアデータとペアデータの両方で整合性 losses を追加する、マルチステージ事前学習を採用する。
クロスモーダル特徴学習を促進するために、ペアデータに対して積極的なマスキングを用いる。
下流タスク（音声翻訳（CoVoST 2）、ASR（LibriSpeech、SpeechStew）、GLUE タスク）向けに微調整し、容量と干渉を分析する。

実験結果

リサーチクエスチョン

RQ1共同で事前学習した場合、単一のエンコーダは音声とテキストの両方に対して有効な表現を学習できるか？
RQ2整合性 losses（TLM と STM）は、自己教師付き学習のみの場合よりもクロスモーダルの整合性と下流の性能を改善するか？
RQ3音声翻訳、ASR、テキスト理解タスクにおけるマルチモーダル事前学習の利点と限界（干渉、容量）は何か？

主な発見

#	モデル	# パラメータ	テキストデータ	En-De	En-Ca	En-Ar	En-Tr	平均
1	wav2vec-2.0 (Wang et al., 2021b )	300M	-	23.8	32.4	17.4	15.4	22.3
2	wav2vec-2.0 + LM (Wang et al., 2021b )	-	-	24.9	34.0	18.0	16.7	23.4
3	w2v-conformer	600M	-	27.1	33.1	18.8	15.6	23.7
4	w2v-bert	600M	-	27.4	33.9	19.0	15.9	24.1
5	w2v-conformer + bert	600M	mC4-En	25.4	30.5	18.5	15.2	22.4
6	w2v-bert + bert (SLAM)	600M	mC4-En	26.9	33.1	18.1	16.1	23.5
7	SLAM-TLM	600M	mC4-En	27.5	33.4	18.9	16.6	24.1
8	SLAM-TLM-STM	600M	mC4-En	27.2	33.3	18.5	16.8	24.0
9	SLAM-TLM-STM → w2v-bert	600M	mC4-En	27.1	34.2	21.2	17.5	25.0

Joint SLAM pre-training yields improvements on CoVoST 2 speech translation (~1 BLEU) over single-modality pre-training.
SLAM achieves competitive performance on LibriSpeech ASR and SpeechStew ASR tasks compared to state-of-the-art mono-modal models.
In GLUE tasks and text normalization, cross-modal interference reduces performance relative to text-only models, indicating capacity limits when modeling two high-resource modalities.
Alignment losses (TLM and STM) improve cross-modal representation alignment and can bridge much of the performance gap caused by interference.
Continuing pre-training on speech data after joint multimodal pre-training yields additional gains for speech translation, demonstrating cross-modal transfer benefits.
Text-only performance remains competitive with early BERT-grade baselines, highlighting capacity constraints of a unified model.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。