[论文解读] Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information
本文提出 mRASP,一种通用的多语种 NMT 预训练方法,使用随机对齐替换来对齐跨语言表示,在对下游对进行微调时,在 42 个方向及稀有翻译方面取得显著提升。
We investigate the following question for machine translation (MT): can we develop a single universal MT model to serve as the common seed and obtain derivative and improved models on arbitrary language pairs? We propose mRASP, an approach to pre-train a universal multilingual neural machine translation model. Our key idea in mRASP is its novel technique of random aligned substitution, which brings words and phrases with similar meanings across multiple languages closer in the representation space. We pre-train a mRASP model on 32 language pairs jointly with only public datasets. The model is then fine-tuned on downstream language pairs to obtain specialized MT models. We carry out extensive experiments on 42 translation directions across a diverse settings, including low, medium, rich resource, and as well as transferring to exotic language pairs. Experimental results demonstrate that mRASP achieves significant performance improvement compared to directly training on those target pairs. It is the first time to verify that multiple low-resource language pairs can be utilized to improve rich resource MT. Surprisingly, mRASP is even able to improve the translation quality on exotic languages that never occur in the pre-training corpus. Code, data, and pre-trained models are available at https://github.com/linzehui/mRASP.
研究动机与目标
- Aim to develop a single universal pre-trained MT seed that can be fine-tuned to any language pair.
- Address limitations of existing MT pre-training objectives by aligning multilingual representations.
- Leverage alignment information to bridge semantic gaps across languages.
- Demonstrate performance gains across extremely low, low, medium, and rich resource settings as well as exotic translation scenarios.
- Provide open access to code, data, and pre-trained models for reproducibility and reuse.
提出的方法
- Use a Transformer-based architecture (6-layer encoder and 6-layer decoder, 1,024 model dimension, 16 attention heads, GeLU activations, learned positional embeddings).
- Pre-train on 32 language pairs with English as the anchor language using the PC32 parallel corpus (197M sentence pairs).
- Introduce Random Aligned Substitution (RAS): randomly substitute a source word with its aligned translation in other languages using MUSE dictionaries, creating code-switched examples that share cross-language semantic space.
- Train with standard translation loss across all language pairs plus language tokens indicating source/target languages.
- Maintain the same architecture and training objective across pre-training and downstream fine-tuning to enable effective transfer and alignment.
- Fine-tune the pre-trained model on downstream language pairs; optionally combine with back-translation to further boost performance.
- Balance vocabulary with shared BPE (32k merges) across languages and over-sample languages to equalize representation.
实验结果
研究问题
- RQ1Can a single universal multilingual pre-trained MT model serve as an effective seed for arbitrary language pairs after fine-tuning?
- RQ2Does Random Aligned Substitution (RAS) effectively bridge semantic representations across languages to improve translation quality?
- RQ3How does mRASP perform across extremely low, low, medium, and rich resource settings and in exotic translation scenarios?
- RQ4What is the relative contribution of pre-training versus fine-tuning to final MT performance?
- RQ5Is mRASP beneficial for exotic translation directions where language pairs or neither language appeared in pre-training?
主要发现
- mRASP yields significant improvements over directly trained bilingual models across extremely low to rich resource settings and for exotic translations.
- Extreme low-resource gains up to +22 BLEU points (e.g., <100k data); large gains also seen in medium and rich-resource settings (e.g., En–Fr, En–Zh).
- Alignment-aware pre-training (RAS) bridges cross-language semantic space, increasing cross-language word similarity and improving translation quality.
- Pre-training with mRASP followed by fine-tuning consistently outperforms NA-mRASP (without RAS) and direct training; back-translation can provide additional boosts (~2 BLEU points).
- Compared to other pre-training models like mBART and XLM, mRASP achieves competitive or superior results on several language pairs; it remains effective even when exotic languages are involved.
- Exotic translation experiments show mRASP benefits across four categories (exotic pair, exotic full, exotic source, exotic target), with substantial gains even when neither language appears in pre-training.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。