QUICK REVIEW

[論文レビュー] SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Seamless Communication, Loïc Barrault|arXiv (Cornell University)|Aug 22, 2023

Natural Language Processing Techniques被引用数 10

ひとこと要約

SeamlessM4Tは、100言語までの音声-to-音声、音声-to-テキスト、テキスト-to-音声、テキスト-to-テキスト翻訳、そしてASRを統合的に実行するモデルで、1M時間のオープンデータと406k時間の結合整列データで訓練された。

ABSTRACT

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

研究の動機と目的

音声とテキストの入力と出力の両方をカバーする単一のマルチタスク・マルチモーダルモデルを構築することで、音声翻訳を前進させる。
English中心のシステムを超えた言語カバレッジと翻訳方向を広げる。
大規模な整列マルチモーダルデータ（SeamlessAlign）を作成・活用してモデルを訓練・評価する。
人間評価と安全性指標（毒性と偏見）を含む堅牢な評価を提供する。
再現性とさらなる研究を可能にするオープンソースのモデル、データ、ツール。

提案手法

1M時間のオープン音声データ上でw2v-BERT 2.0を用いて教師なし音声表現を事前訓練する。
自動的に整列された音声翻訳からなるマルチモーダルコーパスSeamlessAlignを構築し、総計 over 470k hours。
フィルタリングされたSeamlessAlignと人間ラベル付きおよび疑似ラベル付きデータを組み合わせて、S2ST、S2TT、ASR、T2TT、T2STを対象とする100-engおよびeng-35方向のマルチタスクモデルを訓練する。
先行するSOTAおよびカスケードシステムを上回るよう、SeamlessM4T-Large (2.3B parameters) と SeamlessM4T-Medium (1.2B parameters) を訓練する。
Blaser 2.0を用いたモダリティ非依存の品質推定と、標準指標（BLEU、chrF++, WER）および人間の判断とともに評価する。
ノイズや話者変動に対する頑健性を評価し、毒性と性別バイアスを測定して安全な翻訳を保証する。

実験結果

リサーチクエスチョン

RQ1100言語のソース言語に対して、単一モデルはS2ST、S2TT、T2ST、T2TT、およびASRの複数の翻訳モダリティをどれだけうまく実行できるか？
RQ2統一されたマルチモーダルモデルは、標準ベンチマーク（S2ST、S2TT）で cascadedシステムを上回り、英語中心および非英語中心の方向で強力な成績を達成できるか？
RQ3大規模に自動抽出されたデータ（SeamlessAlign）と人間/疑似ラベル付きデータを組み合わせた場合、翻訳品質にどんな影響があるか？
RQ4背景ノイズや話者変動に対する頑健性はどれくらいか、毒性と性別バイアスの安全指標でどうか？
RQ5オープンソース化されたモデル・データ・ツールの実用性と再現性は、広範な研究利用のためにどうか？

主な発見

SeamlessM4T-Largeは、Fleursで前SOTAより20% BLEUポイント分、直接S2TTを英語へ改善。
英語→翻訳方向では、SeamlessM4T-LargeはCoVoST 2でX2T/S2TT指標を prior SOTAより2.8 BLEU向上、Fleursではカスケードシステムと同等。
S2STでは、SeamlessM4T-Largeが強力な3段階カスケードモデルを2.6 ASR-BLEUポイント上回り、CVSSでは2段階カスケードモデルを8.5 ASR-BLEUポイント上回る。
XSTSの英語→XX方向の人間評価は24言語で出力が一貫して4/5を上回る; 英語へ方向ではWhisper-Large-v2をいくつかの言語で上回る改善を示す。
SeamlessM4T-Largeは、背景ノイズに対する頑健性が38%、話者変動に対して49%の改善をもたらす。
追加された毒性は26%から63%低減される条件をまたいで、先端モデルと比較可能。性別バイアスの影響は文書化され、従来のモデルと比較可能。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。

[論文レビュー] SeamlessM4T: Massively Multilingual &amp; Multimodal Machine Translation