QUICK REVIEW

[论文解读] XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Davis Liang, Hila Gonen|arXiv (Cornell University)|Jan 25, 2023

Topic Modeling被引用 10

一句话总结

XLM-V 引入了一个 1M-token 的多语种词汇表，具备聚类的语言特定容量，以克服词汇瓶颈，在多样的多语言任务中相对于 XLM-R 实现稳定增益，尤其是对低资源语言。

ABSTRACT

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This extit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), to named entity recognition (WikiAnn). XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.

研究动机与目标

通过扩展每个语言簇的词汇容量，激励并解决大规模多语言模型中的词汇瓶颈。
开发一种可扩展的方法来构建大型多语言词汇表，在词汇重叠较低时降低跨语言标记共享的强调。
预训练并评估一个具有 1M 词汇表的多语言模型，以评估在多任务和多语言上的性能提升。

提出的方法

在 CC100 派生数据上为每种语言训练基于 ULM 的 SentencePiece（ULM-based）词汇表。
使用来自各语言词汇表的 unigram 对数概率，将每种语言表示为语言指纹。
在这些词汇指纹上使用 K-Means 对语言进行聚类，形成限制跨簇标记共享的语言簇。
使用 ALP 指导的容量分配对每个簇的词汇容量进行分配（缩放至目标总量，例如 1M）。
对每个簇训练 SPM，将簇词汇组合成一个单一的多语言词汇。
在 CC100 上以 MLM 目标预训练一个 12 层 Transformer（1.5M 次迭代，1M vocab），不使用近似 softmax 技巧；通过跨语言迁移进行评估。

Figure 1: Similar to Chung et al. ( 2020 ) , we also leverage the per-language sentencepiece vocabularies as a “lexical fingerprint” for clustering. However, instead of using binary vectors, we use the unigram log probability instead.

实验结果

研究问题

RQ1一个更大、语言感知的多语言词汇表是否能够在跨语言迁移和多语言任务上提升表现？
RQ2语言感知的词汇分配是否可以减少过度分词并提升低资源语言的表现？
RQ3使用 1M 词汇表相对于 250K，在训练速度和模型容量上有哪些权衡？
RQ4是否存在 Zipf 式的天花板，即当词汇超过 1M 时收益递减或性能下降？

主要发现

XLM-V 在所有测试的多语言任务（XNLI、MLQA、XQuAD、TyDiQA、WikiAnn）上的跨语言迁移优于 XLM-R，平均提升约 3.5 点。
XLM-V 在低资源语言上获得显著提升，例如 Swahili 的 XNLI 的准确率提升 +4.7%，Urdu 提升 +2.9%；MasakhaNER 表现为绝对 F1 提升 +11.2%。
XLM-V 在美洲地区 NLI 上实现零-shot 改进，对 Quechua 和 Guaraní 的绝对 F1 分别显著提升（例如分别为 18.2% 和 17.2%）。
使用 1M 词汇的标记化带来更短的输出和有语义意义的片段（例如将中文句子分割成有意义的单元）。
扩展到 1M 以上的词汇可能会降低下游性能，表明存在 Zipf 天花板，大部分内容已被覆盖，尾部标记贡献的信号较少。

Figure 2: We compare the performance of the same model trained with different sentencepiece vocabularies. The models are all trained for 300K iterations with a batch size of 2,048 on the CC100 corpus.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。