QUICK REVIEW

[論文レビュー] Seamless: Multilingual Expressive and Streaming Speech Translation

Seamless Communication, Loïc Barrault|arXiv (Cornell University)|Dec 8, 2023

Topic Modeling被引用数 39

ひとこと要約

本論文は、SeamlessM4T v2、SeamlessExpressive、SeamlessStreaming を導入し、エンドツーエンドの多言語・表現力豊か・ストリーミング音声翻訳を実現する。モデル・データ・安全性ツールを公開する。

ABSTRACT

Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication

研究の動機と目的

多数の言語にわたり声のスタイルと韻律を保持しつつ、自然で表現力豊かでストリーミング対応の音声翻訳のニーズに対処する。
SeamlessM4T v2 など基盤となる多言語・マルチモーダルモデルを開発・強化し、表現力豊かでストリーミング対応の S2ST を支える。
声のスタイルの保存と低遅延・多対多翻訳のための 2 つの専門モデル（SeamlessExpressive と SeamlessStreaming）を導入する。
表現力・堅牢性・遅延・意味論を評価する自動および人間による包括的な評価パイプラインを提供する。
Red-teaming、毒性と偏見評価、透かし機構を含む責任ある AI の実践を推進し、ツールを公開する。

提案手法

効率的なユニット予測とアップサンプリングのために UnitY2 を用いて SeamlessM4T を高度化する。
大規模な未ラベルデータ上で広範な多言語・マルチモーダルモデル（SeamlessM4T v2）を事前学習し、低リソース言語に対して自動的に整列させたペアでファインチューニングする。
複数の言語（English, French, German, Italian, Mandarin, Spanish）にわたり声のスタイルと韻律を保存する SeamlessExpressive を開発する。
Efficient Monotonic Multihead Attention (EMMA) を用いて SeamlessStreaming を開発し、低遅延の多対多ストリーミング翻訳（音声から音声/テキスト）を実現する。
新規の自動的表現力指標（AutoPCP rhythm evaluation）を作成し、表現力と意味論の評価のために人間指標（MOS, XSTS, PCP）を適用・適応させる。
Red-teaming、毒性緩和、性別バイアス評価、不可聴 watermarking 機構（SeamlessWM）を含む包括的な責任ある AI ツールキットを実装する。

実験結果

リサーチクエスチョン

RQ1単一の多言語モデルが、スケールを拡げて表現力豊かでストリーミング対応のクロスリンガル音声翻訳をどのように支えることができるか？
RQ2表現力豊かな S2ST は、意味的忠実性を維持しつつ、リズム・ポーズ・声のスタイルを言語間で保てるか？
RQ3リアルタイムでの多言語 S2ST に対する効果的な低遅延ストリーミング戦略は何か？
RQ4多言語の表現力豊かな S2ST システムにおいて、安全性・偏見・悪用をどのように検出・緩和できるか？
RQ5実世界の利用ケースで表現力・堅牢性・遅延を最もよく捉える評価プロトコルは何か？

主な発見

SeamlessM4T v2 は ~100 languages の音声翻訳とテキスト翻訳タスクにおいて、最先端の意味精度を達成します。
SeamlessExpressive は、発話速度やポーズを含む声のスタイルと韻律を六言語にわたって保持する翻訳を可能にします。
SeamlessStreaming は EMMA を用いて、音声から音声および音声からテキスト出力の低遅延・多対多ストリーミング翻訳を提供します。
統合システム Seamless は、表現力とストリーミングを組み合わせてリアルタイムの表現力豊かなクロスリンガル通信を実現します。
カスタム指標と red-teaming を備えた自動・人間の包括的評価スイートは、性能・安全性・バイアスの考慮を示します。
水marking detector を含むすべてのモデル・データ・ツールが公開されます。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。