QUICK REVIEW

[Paper Review] XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Davis Liang, Hila Gonen|arXiv (Cornell University)|Jan 25, 2023

Topic Modeling10 citations

TL;DR

XLM-V introduces a 1M-token multilingual vocabulary with clustered language-specific capacities to overcome the vocabulary bottleneck, achieving consistent gains over XLM-R across diverse multilingual tasks, especially for low-resource languages.

ABSTRACT

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This extit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

Motivation & Objective

Motivate and address the vocabulary bottleneck in massively multilingual models by expanding vocabulary capacity per language cluster.
Develop a scalable method to construct large multilingual vocabularies that de-emphasize cross-language token sharing where lexical overlap is low.
Pretrain and evaluate a multilingual model with a 1M token vocabulary to assess performance gains across multiple tasks and languages.

Proposed method

Train per-language SentencePiece (ULM-based) vocabularies on CC100-derived data.
Represent each language by a language fingerprint using unigram log probabilities from per-language vocabularies.
Cluster languages with K-Means on these lexical fingerprints to form language clusters that limit cross-cluster token sharing.
Allocate per-cluster vocabulary capacity using ALP-informed capacity assignment (scaled to a target total, e.g., 1M).
Train per-cluster SPMs and combine cluster vocabularies into a single multilingual vocabulary.
Pretrain a 12-layer transformer with MLM objective on CC100 (1.5M iterations, 1M vocab) without approximate softmax tricks; evaluate via cross-lingual transfer.

Figure 1: Similar to Chung et al. ( 2020 ) , we also leverage the per-language sentencepiece vocabularies as a “lexical fingerprint” for clustering. However, instead of using binary vectors, we use the unigram log probability instead.

Experimental results

Research questions

RQ1Can a larger, language-aware multilingual vocabulary improve cross-lingual transfer and task performance across diverse languages?
RQ2Do language-aware vocabulary allocations reduce over-tokenization and improve low-resource language performance?
RQ3What are the trade-offs in training speed and model capacity when using a 1M-token vocabulary compared to 250K?
RQ4Is there a Zipf-like ceiling where increasing vocabulary beyond 1M yields diminishing returns or degradation?

Key findings

XLM-V outperforms XLM-R on all tested multilingual tasks (XNLI, MLQA, XQuAD, TyDiQA, WikiAnn) in cross-lingual transfer, with an average gain of about 3.5 points.
XLM-V achieves substantial gains on low-resource languages, e.g., +4.7% accuracy on Swahili and +2.9% on Urdu for XNLI; MasakhaNER shows +11.2% absolute F1.
XLM-V delivers zero-shot improvements on Americas NLI, with notable gains on Quechua and Guaraní (e.g., 18.2% and 17.2% absolute F1, respectively).
Tokenization with the 1M vocabulary yields shorter outputs and semantically meaningful segments (e.g., Chinese sentence segmentation into meaningful units).
Expanding beyond 1M tokens can degrade downstream performance, indicating a Zipf ceiling where most content is already covered and tail tokens contribute little useful signal.

Figure 2: We compare the performance of the same model trained with different sentencepiece vocabularies. The models are all trained for 300K iterations with a batch size of 2,048 on the CC100 corpus.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.