QUICK REVIEW

[論文レビュー] Bridging the Modality Gap for Speech-to-Text Translation

Yuchen Liu, Junnan Zhu|arXiv (Cornell University)|Oct 28, 2020

Speech Recognition and Synthesis参考文献 38被引用数 39

ひとこと要約

STASTモデルは音声翻訳エンコーダを分離し、縮小メカニズムを導入し、共有潜在空間にテキストベースのMTモデルを統合し、音声-テキストのモダリティギャップを埋めるためのクロスメディア適応を適用して、英仏および英独のSTタスクで最先端の結果を達成します。

ABSTRACT

End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate a text-based translation model into the STAST so that two tasks can be trained in the same latent space. Furthermore, we introduce a cross-modal adaptation method to close the distance between speech and text representation. Experimental results on English-French and English-German speech translation corpora have shown that our model significantly outperforms strong baselines, and achieves the new state-of-the-art performance.

研究の動機と目的

エンドツーエンドの音声からテキスト翻訳を動機づけ、音声とテキスト間のモダリティギャップに対処する。
学習負荷を軽減するため、音声翻訳エンコーダを専門のコンポーネントに分離する。
STモデル内でテキストベースのMTモデルを活用し、意味表現を整合させる。
クロスメディア適応を用いて音声-テキスト表現のギャップを縮小する。
英語-フランス語および英語-ドイツ語のSTベンチマークで最先端の性能を示す。

提案手法

STエンコーダを三つの部分に分離する：音響エンコーダ、縮小メカニズム、意味エンコーダ。
音響エンコーダ上でCTCモジュールを用いて源テキストの転写を予測し、音声表現の長さを転写の長さに合わせるよう縮小メカニズムを適用する。
意味エンコーダ/デコーダを共有し、CTC、埋め込み、およびMT出力空間を整合させることで、完全なMTモデルを統合する。
シーケンスレベルまたは単語レベルのMSEベース損失を用いた、音声とテキストの表現を整合させるクロスメディア適応を適用する。
CTC損失、ST損失、MT損失、およびクロスメディア適応損失の組み合わせを多タスク設定で訓練する。

実験結果

リサーチクエスチョン

RQ1STエンコーダを分離して音響表現学習と意味表現学習を分離することで、学習を容易にできるか？
RQ2縮小メカニズムは、音声フレームとテキストトークン間の長さの不一致を効果的に解決するか？
RQ3共有潜在空間でMTを統合することで、STの意味表現を改善できるか？
RQ4クロスメディア適応はさらにモダリティ間距離を縮小し、翻訳精度を向上させるか？

主な発見

STASTはAugmented LibriSpeech En-Frで最先端または競争力のあるBLEUを達成し、いくつかのエンドツーエンドベースラインを上回る。
MuST_C En-Deでは、STASTがベース設定のすべての従来のエンドツーエンドモデルを上回り、SpecAugmentを用いても堅調である。
アブレーション実験は、提案されたすべての構成要素（縮小、意味エンコーダ、クロスメディア適応、マルチタスク学習、CTC損失）が性能向上に寄与することを示す。
シーケンスレベルのクロスメディア適応は、一般に単語レベルの適応よりわずかに高い性能を示す。
STASTはデータ規模に対して頑健性を示し、追加のASRデータおよび一定程度の追加のMTデータの恩恵を受ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。