QUICK REVIEW

[논문 리뷰] Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

Yexing Du, Youcheng Pan|arXiv (Cornell University)|2026. 02. 25.

Natural Language Processing Techniques인용 수 0

한 줄 요약

소개하는 Speech-guided Machine Translation (SMT) 프레임워크는 TTS 모델에서 생성된 합성 음성과 텍스트 입력을 다중모달 대형 언어 모델(MLLM)과 융합하고, 번역을 반복적으로 개선하기 위한 Self-Evolution 메커니즘을 추가로 도입하여 Multi30K와 FLORES-200에서 최첨단 결과를 달성합니다.

ABSTRACT

Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at https://github.com/yxduir/LLM-SRT.

연구 동기 및 목표

음성 modality를 다중모달 번역에 대한 확장 가능한 다국어 수단으로 활용하는 것을 모티브로 삼아 이미지 기반 접근 방식 너머를 탐구한다.
TTS 생성기를 MLLM과 결합한 Speech-guided Machine Translation 프레임워크를 제안한다.
Self-Evolution 메커니즘을 도입하여 데이터를 자율적으로 합성하고 번역 품질을 반복적으로 개선한다.
MLLM을 단계적으로 사전 학습시켜(ASR, S2TT, SMT) 음성과 텍스트 간의 간극을 연결한다.
다중 언어 MT 벤치마크에서 28개 언어에 걸친 확장성과 강력한 성능을 입증한다.

제안 방법

고정된 Whisper 기반 음성 인코더에 trainable adapter(Q-Former + MLP)을 통해 MLLM 입력 경로로 활용한다.
세 단계로 구성된 MLLM 사전 학습 파이프라인을 채택한다: ASR, 음성-문자 텍스트 번역(S2TT), 그리고 음성-가이드 머신 트랜스레이션(SMT).
데이터 증강을 위해 텍스트에 정렬된 합성 음성을 생성하는 TTS 모델(CosyVoice2)을 도입한다.
경험 취득, 정제, 업데이트 및 평가로 구성된 Self-Evolution 루프를 구현하여 양의 샘플(S2TT/SMT 점수)로 번역을 지속적으로 개선한다.
BLEU, spBLEU, COMET으로 Multi30K, FLORES-200, WMT24++를 평가하고 CoVoST-2에서의 제거 요인(ablation)을 수행한다.

실험 결과

연구 질문

RQ1음성 모달리티를 텍스트와 융합할 때 이미지 기반 방법을 넘는 다국어 MT의 향상을 기대할 수 있는가?
RQ2SMT에서 MLLM을 훈련시키고 지속적으로 개선하는 데 합성 음성(TTS)이 얼마나 효과적인가?
RQ3실제 음성과 합성 음성의 번역 품질에 미치는 차이는 무엇인가?
RQ4다수의 언어 및 방향(28개 언어, 108 FLORES-200 방향)에서 SMT 접근 방식의 확장성은 어느 정도인가?

주요 결과

SMT 프레임워크는 Multi30K에서 새로운 최첨단 결과를 달성하며 텍스트 전용 및 이미지 기반 MMT 모델을 능가한다.
FLORES-200에서 SMT는 108개의 번역 방향에 걸친 평균 MT 성능에서 최첨단에 도달하며 더 큰 언어 모델보다 우수한 성능을 보인다.
CoVoST-2에 대한 차등 실험에서 번역 품질 측면에서 실제 음성과 합성 음성 간의 차이는 무시할 만한 수준이다.
Self-Evolution 라운드는 자원이 적은 언어(khm, lao, mya)에서 뚜렷한 이득을 가져오며, 특히 초기 라운드에서 가장 큰 개선이 나타난다.
수동 평가에 따르면 음성 모달리티가 주의(attention)를 정렬하고 운율 신호를 제공하여 과소 번역을 감소시키는 데 기여한다.
SMT-9B는 대형 텍스트 전용 모델의 약 1/67 크기에 불과하지만 교차 모달 정보를 활용하여 우수한 성능을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.