QUICK REVIEW

[논문 리뷰] No Language Left Behind: Scaling Human-Centered Machine Translation

Nllb Team, Marta R. Costa‐jussà|arXiv (Cornell University)|2022. 07. 11.

Natural Language Processing Techniques인용 수 360

한 줄 요약

본 논문은 200개 언어에 걸친 대형 인간 중심 MT 시스템을 희소 게이트된 전문가 모델, 새로운 데이터 마이닝, 그리고 포괄적 인간 및 안전 평가를 이용해 학습시키고, 이전 최첨단 대비 44% BLEU 향상을 달성하였다.

ABSTRACT

Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system. Finally, we open source all contributions described in this work, accessible at https://github.com/facebookresearch/fairseq/tree/nllb.

연구 동기 및 목표

저자들은 자원이 한정된 언어 번역의 필요성과 그것이 사회에 미치는 영향을 제시한다.
저자들은 저자원 언어와 고자원 언어 간의 성능 격차를 좁히기 위한 데이터세트와 모델을 개발한다.
희소 게이트된 전문가들의 혼합으로 구성된 조건부 컴퓨트 모델을 제안한다.
과적합을 완화하고 수천 개의 번역 작업에 대해 학습한다.
Flores-200 전반에 걸친 인간 벤치마크와 독성 벤치마크를 통해 번역 품질과 안전성을 평가한다.

제안 방법

희소 게이트된 전문가의 혼합(MoE)을 이용한 조건부 컴퓨트 모델을 제안한다.
저자원 언어에 특화된 새로운 기법으로 데이터 마이닝으로 학습한다.
수천 개의 작업에서의 과적합에 대응하기 위한 구조적 및 학습 개선을 도입한다.
인간 번역 Flores-200 벤치마크를 통해 40,000개가 넘는 번역 방향을 평가한다.
Flores-200 모든 언어를 포괄하는 새로운 독성 벤치마크와 인간 평가를 결합한다.
커뮤니티 재사용을 위해 모든 기여를 오픈소스로 공개한다.

실험 결과

연구 질문

RQ1범용 번역 시스템을 200개 언어로 확장하고도 높은 품질과 안전성을 유지할 수 있는가?
RQ2,

주요 결과

이전 최첨단 대비 44% BLEU 향상을 달성한다.
인간 번역 Flores-200 벤치마크로 40,000개가 넘는 번역 방향을 평가한다.
Flores-200 전 언어에 걸친 독성 벤치마크를 통해 번역 안전성을 평가한다.
구조적 및 학습 개선을 통해 수천 개의 작업을 효과적으로 처리하는 것을 입증한다.
모든 데이터, 모델 및 방법론을 오픈소스로 공개하여 재현성과 더 넓은 채택을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.