[論文レビュー] Massively Multilingual Neural Machine Translation
本論文は英語をハブとした英語中心の英語を含む多言語NMTを用いて102言語へスケールし、英語への翻訳および英語からの翻訳を204方向で行う単一のTransformerモデルを訓練し、低資源設定で特にバイリンガルのベースラインより改善を示す一方、非英語ターゲットではいくつかのトレードオフが生じることを示す。
Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English within a single model. We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions. We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages. Our experiments on a large-scale dataset with 102 languages to and from English and up to one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT.
研究の動機と目的
- Show that an English-centric, massively multilingual NMT model can scale to a large number of languages and translation directions.
- Evaluate translation quality across low-resource and high-resource settings on TED Talks and a large internal dataset.
- Analyze how the number of languages involved affects performance and generalization, including zero-shot translation.
- Compare many-to-many models to many-to-one and bilingual baselines under identical training conditions.
- Identify practical trade-offs and future directions for massively multilingual NMT.
提案手法
- Use Transformer base architecture (6-layer encoder/decoder, 512 model dim, 2048 hidden, 8 heads) with dropout and inverse square-root learning rate scheduling.
- Train English-centric many-to-many models on 116 directions (58 languages to/from English) using joint subword segmentation (32k vocab) and heterogeneous batching.
- Compare against bilingual baselines and prior multilingual approaches under identical conditions.
- Evaluate on TED Talks multilingual corpus (59 languages, 116 directions) and on a large in-house 103-language corpus (102 languages, 204 directions) with up to 1,000,000 examples per language pair.
- Investigate effects of training set size, resource level, and model capacity on translation quality and zero-shot performance.
実験結果
リサーチクエスチョン
- RQ1How well can a single NMT model scale to support translation across a very large set of languages and directions?
- RQ2Does a massively multilingual many-to-many setup outperform bilingual and many-to-one baselines in both low-resource and high-resource settings?
- RQ3How does increasing the number of languages involved affect translation quality and zero-shot generalization?
- RQ4What are the trade-offs between model capacity, number of tasks, and data size in massively multilingual NMT?
- RQ5Can multilingual training improve zero-shot translation and cross-language transfer?
主な発見
- Massively multilingual many-to-many models outperform bilingual baselines and many-to-one models on English-to-X directions in the low-resource TED setting.
- In English-to-X, many-to-many models achieve about 1.82 BLEU average improvement over best Neubig & Hu (2018) many-to-one baselines and 2.44 BLEU over their many-to-one model on the four low-resource pairs.
- On the 103-language high-resource setup, both many-to-one and many-to-many models beat baselines on average when translating to English, with the many-to-one often performing best except for some language pairs (e.g., German-to-English from the German-English development set).
- When translating from English to other languages, the one-to-many model generally outperforms the many-to-many setup under the same conditions.
- Zero-shot and multilinguality analyses show a trade-off: increasing languages can improve zero-shot performance but may reduce supervised performance for some pairs under fixed capacity; mid-range subsets (e.g., 50-to-50) balance generalization and accuracy.
- Zero-shot improvements emerge with more languages, but gains vary by language pair and dataset size.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。