QUICK REVIEW

[논문 리뷰] SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Seamless Communication, Loïc Barrault|arXiv (Cornell University)|2023. 08. 22.

Natural Language Processing Techniques인용 수 10

한 줄 요약

SeamlessM4T는 최대 100개 언어에 대해 음성-음성, 음성-텍스트, 텍스트-음성, 텍스트-텍스트 번역 및 ASR을 수행하는 단일 통합 모델로, 1M hours의 공개 데이터와 406k hours의 결합 정렬 데이터로 학습되었습니다.

ABSTRACT

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

연구 동기 및 목표

음성 및 텍스트 입력과 출력을 모두 다루는 단일 다중작업(multitask) 다중모달 모델을 구축하여 음성 번역을 발전시킨다.
영어 중심 시스템을 넘어 언어 커버리지와 번역 방향을 확장한다.
모델을 학습하고 평가하기 위한 대규모 정렬 다중모달 데이터(SeamlessAlign)를 생성하고 활용한다.
사람 평가 및 독성/편향성과 같은 안전 지표를 포함한 견고한 평가를 제공한다.
재현성과 추가 연구를 가능하게 하는 오픈소스 모델, 데이터 및 도구를 제공한다.

제안 방법

w2v-BERT 2.0을 사용한 비지도 음성 표현의 사전 학습으로 1 million hours의 공개 음성 데이터 활용.
자동으로 정렬된 음성 번역 다큐먼트로 구성된 SeamlessAlign를 구축하여 총 합계가 470k hours를 초과하는 다중모달 코퍼스를 생성.
필터링된 SeamlessAlign를 사람 라벨링 데이터 및 의사 라벨링 데이터와 결합하여 100-eng 및 eng-35 방향에 대해 S2ST, S2TT, ASR, T2TT, T2ST를 포괄하는 다중작업 모델을 학습.
SeamlessM4T-Large(2.3B parameters) 및 SeamlessM4T-Medium(1.2B parameters)을 학습시켜 기존 SOTA 및 계단식 시스템을 능가.
표현 양식에 구애받지 않는 품질 추정을 위한 Blaser 2.0 및 표준 지표(BLEU, chrF++, WER)와 인간 판단을 함께 평가.
배경 소음 및 화자 변이에 대한 강건성 평가를 수행하고, 안전한 번역을 보장하기 위해 독성 및 성별 편향을 측정.

실험 결과

연구 질문

RQ1단일 모델이 100개의 원천 언어에 대해 S2ST, S2TT, T2ST, T2TT 및 ASR의 다중 번역 모달리티를 얼마나 잘 수행할 수 있는가?
RQ2통합 다중모달 모델이 표준 벤치마크(S2ST, S2TT)에서 계단식 시스템보다 우수하고 영어 중심 방향과 비영어권 방향 모두에서 강력한 성능을 달성하는가?
RQ3대규모 자동 채굴 데이터(SeamlessAlign)를 사람 라벨링/의사 라벨링 데이터와 결합했을 때 번역 품질에 어떤 영향을 미치는가?
RQ4배경 소음과 화자 변이에 대한 모델의 강건성은 어떠하며, 독성 및 성별 편향 안전 지표에서 어떤 성과를 보이는가?
RQ5모델/데이터/도구를 오픈소스화하여 광범위한 연구 활용이 실제로 가능하고 재현성이 있는가?

주요 결과

SeamlessM4T-Large는 Fleurs에서 이전 SOTA 대비 영어로의 직접 S2TT를 20% BLEU 포인트 개선(총 20 BLEU-point gap)으로 향상시켰다.
영어에서의 방향에 대해 SeamlessM4T-Large는 CoVoST 2에서 X2T/S2TT 지표를 기존 SOTA 대비 2.8 BLEU 개선했고 Fleurs에서 계단식 시스템과 비슷한 성능을 보인다.
S2ST에서 SeamlessM4T-Large는 Fleurs에서 강력한 3단계 계단식 모델보다 2.6 ASR-BLEU 포인트 우수하고 CVSS에서 2단계 계단식 모델보다 8.5 ASR-BLEU 포인트를 상회한다.
영어→XX 방향의 XSTS 인간 평가에서 24개 언어가 출력이 지속적으로 4/5 이상을 기록; 영어로의 방향에서 Whisper-Large-v2에 대해 여러 언어에서 개선이 관찰된다.
SeamlessM4T-Large는 Fleurs 기반의 강건성 평가에서 배경 소음에 대해 38%의 강건성 증가, 화자 변이에는 49%의 증가를 보였다.
독성은 현 상태의 최첨단 모델 대비 26%에서 63%까지 조건별로 감소했으며; 성별 편향 효과가 문서화되었고 기존 모델과 비교 가능한 수준이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.

[논문 리뷰] SeamlessM4T: Massively Multilingual &amp; Multimodal Machine Translation