[Paper Review] Cross-lingual Language Model Pretraining
The paper introduces cross-lingual language models (XLM) with unsupervised (CLM/MLM) and supervised (TLM) pretraining to learn multilingual representations, achieving state-of-the-art results in cross-lingual classification and both unsupervised and supervised machine translation.
Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.
Motivation & Objective
- Demonstrate that cross-lingual pretraining improves multilingual sentence representations.
- Propose unsupervised cross-lingual objectives (CLM, MLM) for monolingual data.
- Introduce a supervised cross-lingual objective (TLM) leveraging parallel data.
- Show state-of-the-art performance on XNLI, unsupervised MT, and supervised MT.
- Highlight benefits for low-resource languages and cross-lingual embeddings.
Proposed method
- Use a shared subword vocabulary learned by Byte Pair Encoding across N languages.
- Train Transformer language models with CLM on monolingual data to predict a word from previous context.
- Train MLM by masking 15% of tokens and predicting them with context, with streaming of multiple sentences per batch.
- Introduce Translation Language Modeling (TLM), concatenating parallel sentences and masking tokens so the model can attend to both source and target contexts to align representations.
- Fine-tune pretrained XLMs on cross-lingual classification tasks by adding a linear classifier on the first hidden state and training on English NLI data while evaluating in 15 languages.
- Evaluate unsupervised MT by initializing encoder/decoder with various pretraining schemes (EMB, CLM, MLM) and training with denoising auto-encoding and back-translation.
- Evaluate supervised MT by pretraining with CLM/MLM and training on WMT’16 Romanian-English.
- Demonstrate improved perplexities for low-resource language modeling when mixing related language data.
Experimental results
Research questions
- RQ1Can unsupervised cross-lingual objectives (CLM, MLM) produce transferable multilingual representations without parallel data?
- RQ2Does incorporating a supervised cross-lingual objective (TLM) leveraging parallel data improve cross-lingual transfer?
- RQ3How do XLM pretraining methods affect cross-lingual classification (XNLI) and machine translation (unsupervised and supervised)?
- RQ4What is the impact of cross-lingual pretraining on low-resource languages and cross-lingual word embeddings?
Key findings
- Unsupervised MLM and MLM+CLM baselines achieve strong cross-lingual classification performance, with MLM+TLM providing a substantial boost.
- On XNLI, MLM+TLM achieves state-of-the-art average accuracy improvements (up to 4.9% absolute over prior ARTETXE/SOTA in zero-shot classification).
- Unsupervised MT benefits significantly from MLM pretraining, reaching 34.3 BLEU on WMT’16 German-English (surpassing prior state of the art by >9 BLEU).
- Supervised MT benefits from pretraining, with Romanian-English reaching 38.5 BLEU, surpassing prior SOTA by >4 BLEU.
- Cross-lingual pretraining improves Nepali perplexity when leveraging Hindi/English data (e.g., Nepali+Hindi yields 115.6 vs 157.2).
- XLM embeddings outperform MUSE and Concat in cross-lingual word similarity metrics (SemEval’17) and show closer word translation pairs.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.