QUICK REVIEW

[论文解读] WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Benjamin Minixhofer, Fabian Paischer|arXiv (Cornell University)|Dec 13, 2021

Topic Modeling参考文献 47被引用 25

一句话总结

WECHSEL 是一种通过使用多语言静态词嵌入来初始化子词嵌入，将单语语言模型迁移至新语言的方法，在最多节省64倍训练成本的情况下，实现了与从零开始训练的模型相当的性能。该方法在多种语言（包括低资源语言）上均优于随机初始化和先前的迁移方法（如 TransInner）。

ABSTRACT

Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method -- called WECHSEL -- to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses subword-based tokenization and learns an embedding for each subword. The tokenizer of the source model (in English) is replaced with a tokenizer in the target language and token embeddings are initialized such that they are semantically similar to the English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer the English RoBERTa and GPT-2 models to four languages (French, German, Chinese and Swahili). We also study the benefits of our method on very low-resource languages. WECHSEL improves over proposed methods for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.

研究动机与目标

解决在非英语语言中从零开始训练大型语言模型所带来的高计算成本和环境负担。
通过利用多语言静态词嵌入对子词嵌入进行初始化，提高跨语言迁移效率。
在极小训练量下，实现对 RoBERTa 和 GPT-2 等单语模型向低资源和中等资源语言的有效迁移。
减少对因‘多语言诅咒’而导致性能下降的大型多语言模型的依赖。
使在新语言中训练大型语言模型更加可行且环境可持续。

提出的方法

通过将源英语模型的所有非嵌入参数复制到目标语言模型中，实现单语语言模型的迁移。
将英语分词器替换为目标语言分词器，以支持新语言的子词分词。
通过语义相似性将多语言静态词嵌入映射到子词单元，以初始化目标语言的子词嵌入。
使用多语言词嵌入（例如 fastText）将目标语言的子词与语义上相似的英语子词对齐。
通过极少的微调步骤训练迁移后的模型，与从零开始训练相比，显著降低了训练成本。
该方法适用于多种语言的编码器（RoBERTa）和解码器（GPT-2）架构，包括低资源语言。

实验结果

研究问题

RQ1能否通过多语言静态词嵌入有效初始化子词嵌入，以改善单语语言模型的跨语言迁移？
RQ2与随机初始化或先前的迁移方法相比，WECHSEL 是否能显著减少达到高性能所需的训练步数？
RQ3在数据和计算资源有限的低资源语言上，WECHSEL 的有效性如何？
RQ4与需要显著更多计算资源从零开始训练的同规模模型相比，WECHSEL 是否表现更优？
RQ5与 TransInner 等方法相比，在使用 WECHSEL 时是否需要冻结非嵌入参数？

主要发现

WECHSEL 在所有语言和任务上均优于随机初始化模型（FullRand）和 TransInner 方法，包括 RoBERTa 的命名实体识别（NER）和自然语言推理（NLI）任务，以及 GPT-2 的困惑度。
对于 RoBERTa，WECHSEL 仅使用 CamemBERT 和 GBERTBase 等模型所需训练步数的 1/64，就在法语、德语、中文和 Swahili 的 NER 和 NLI 任务上达到了最先进性能。
对于 GPT-2，WECHSEL 在中低资源语言上的困惑度均低于 FullRand 和 TransInner，即使在数据量极少的情况下也表现出一致的性能提升。
在低资源语言（如巽他语、苏格兰盖尔语、维吾尔语和马达加斯加语）上，随着数据稀缺性的增加，WECHSEL 展现出更强的性能增益，表明其在低资源设置下具有更高的鲁棒性。
使用 WECHSEL 时，冻结非嵌入参数并非必要，而 TransInner 方法则需要，这表明语义初始化从训练初期即稳定了模型。
该方法使在新语言中训练有效的单语语言模型的训练工作量最多可减少64倍，相比从零开始训练的同类模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。