QUICK REVIEW

[論文レビュー] Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

Yexing Du, Youcheng Pan|arXiv (Cornell University)|Feb 25, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

音声誘導型翻訳機械翻訳（SMT）フレームワークを導入し、TTSモデルからの合成音声とテキスト入力をMultimodal Large Language Model（MLLM）に統合、自己進化メカニズムで翻訳を反復的に改善し、Multi30KとFLORES-200で最 state-of-the-art を達成。

ABSTRACT

Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at https://github.com/yxduir/LLM-SRT.

研究の動機と目的

スケーラブルな多言語モダリティとして音声を活用し、画像ベースのアプローチを超えるマルチモーダル翻訳を推進する。
TTS ジェネレータを MLLM に統合した Speech-guided Machine Translation フレームワークを提案する。
自己進化メカニズムを導入し、データを自動的に合成し翻訳品質を反復的に改善する。
MLLM を段階的に事前学習（ASR、S2TT、SMT）して音声とテキストの橋渡しを行う。
多言語 MT ベンチマークで28言語にわたるスケーラビリティと高性能を実証する。

提案手法

凍結済み Whisper 系の音声エンコーダとtrainable adapter（Q-Former + MLP）をMLLM入力経路として使用。
3段階のMLLM事前学習パイプラインを採用：ASR、音声からテキストへの翻訳（S2TT）、音声誘導機械翻訳（SMT）。
データ拡張のためにテキストと対応付けた合成音声を生成するTTSモデル（CosyVoice2）を組み込む。
自己進化ループを実装し、経験取得、洗練、更新、評価を通じて正例（S2TT/S MTスコア）を用いて翻訳を継続的に改善。
Multi30K、FLORES-200、WMT24++でBLEU、spBLEU、COMETを用いて評価。CoVoST-2でアブレーションを実施。

実験結果

リサーチクエスチョン

RQ1音声モダリティをテキストと統合することで、画像ベースの方法を超える多言語 MT が可能か？
RQ2SMT における訓練と継続的改善のために、合成音声（TTS）はどの程度有効か？
RQ3 authentic な音声と synthetic な音声が翻訳品質に与える影響は？
RQ4SMT アプローチのスケーラビリティは、言語と方向の多数（28言語、108 FLORES-200 方向）でどの程度か？

主な発見

SMT フレームワークは多言語 MT の新たな state-of-the-art を達成し、テキストのみのモデルと画像ベースの MMT モデルを上回る。
FLORES-200 では、108 の翻訳方向全体で平均 MT 性能が最先端となり、より大きな言語モデルを凌駕。
CoVoST-2 のアブレーションで、翻訳品質に authentic 音声と synthetic 音声の有意差はほとんど見られない。
自己進化のラウンドは低資源言語（khm、lao、mya）で顕著な改善を生み出し、特に初期ラウンドで最大の改善を示す。
手動評価では、音声モダリティがアテンションを整列させ、プロソディの手掛かりを提供することで過小訳を抑制する傾向が示唆される。
SMT-9B は大規模なテキストのみモデルの約1/67のパラメータ量ながら、クロスモーダル情報を活用して優れた性能を達成できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。