[Paper Review] Beyond English-Centric Multilingual Machine Translation
The paper builds a true Many-to-Many translation model (M2M-100) for 100 languages without pivoting through English, leveraging large-scale data mining, backtranslation, and a mix of dense and sparse parameters to achieve strong non-English translation performance.
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.
Motivation & Objective
- Address the English-centric bias in multilingual MT by enabling direct translation between non-English language pairs.
- Create a large-scale, 100-language parallel dataset (7.5B sentences, 2200 directions) using multilingual data mining and backtranslation.
- Investigate model scaling via dense capacity and language-specific sparse parameters to handle quadratic data growth.
- Propose a bridge-language data mining strategy to efficiently mine useful bitext without exhaustively covering all language pairs.
- Evaluate the resulting M2M-100 model against bilingual baselines and WMT-style benchmarks to demonstrate competitive performance.
Proposed method
- Use a Transformer-based seq2seq architecture with 12 encoder and 12 decoder layers and 1.2B parameters as a base model, trained with label smoothing and LayerDrop for stabilization.
- Adopt SentencePiece subword segmentation with a multilingual dictionary of 128k tokens balanced across languages using temperature sampling.
- Construct a Many-to-Many parallel dataset for 100 languages via bridge-language mining, grouping languages into 14 clusters and using 26 bridge languages, plus mining against English.
- Leverage data mining pipelines LASER-based embeddings and FAISS indexing to mine parallel data from CCMatrix/CCAligned, with post-filtering and language-specific checks.
- Augment mined data with backtranslation for 100 directions with BLEU 2–10, sampling 50M monolingual sentences per target language and tagging BT data.
- Incorporate a hybrid dense-sparse parameter strategy (mixture-of-experts) with language-specific routing to scale to 15.4B parameters while maintaining trainability on hundreds of GPUs.
Experimental results
Research questions
- RQ1Can a true Many-to-Many MT system directly translate between any pair among 100 languages without English pivoting and achieve competitive performance?
- RQ2How does bridge-language based mining compare to English-centric mining in terms of data efficiency and translation quality across language directions?
- RQ3What is the impact of dense scaling and language-specific sparse parameters on model capacity and translation quality in a 100-language setting?
- RQ4Does backtranslation consistently improve translation quality across diverse language directions in a Many-to-Many setup?
Key findings
- Direct translation between non-English directions gains over 10 BLEU when comparing non-English directions directly, in contrast to English-centric baselines.
- Bridge-language mining with 14 language groups and 26 bridge languages yields more parallel data (5–10x) than English-centric mining, improving coverage for mid- and low-resource languages.
- Backtranslation consistently improves BLEU across directions, particularly for low-performance pairs, when added to the Many-to-Many training data.
- M2M-100 with scaling (dense and sparse mixtures) can reach up to 15.4B parameters and maintain efficient training, enabling direct 100x100 translation directions.
- On standard benchmarks, the Many-to-Many model is competitive with the best single bilingual systems like WMT, despite the vastly larger set of directions.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.