QUICK REVIEW

[论文解读] Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language

Yuri Kuratov, Mikhail Arkhipov|arXiv (Cornell University)|May 17, 2019

Topic Modeling参考文献 16被引用 257

一句话总结

该论文表明：从多语言 BERT 初始化单语俄语 BERT 模型可提升俄语 NLP 任务的性能并缩短训练时间，且采用来自多语言根源的俄语特定词汇表和嵌入。

ABSTRACT

The paper introduces methods of adaptation of multilingual masked language models for a specific language. Pre-trained bidirectional language models show state-of-the-art performance on a wide range of tasks including reading comprehension, natural language inference, and sentiment analysis. At the moment there are two alternative approaches to train such models: monolingual and multilingual. While language specific models show superior performance, multilingual models allow to perform a transfer from one language to another and solve tasks for different languages simultaneously. This work shows that transfer learning from a multilingual model to monolingual model results in significant growth of performance on such tasks as reading comprehension, paraphrase detection, and sentiment analysis. Furthermore, multilingual initialization of monolingual model substantially reduces training time. Pre-trained models for the Russian language are open sourced.

研究动机与目标

证明从多语言 BERT 转移到单语俄语模型可带来性能提升。
证明多语言初始化可以加速收敛并降低俄语模型的训练时间。
开发 RuBERT，使用俄语特定词汇表并在俄语 NLP 任务上进行评估。
在 DeepPavlov 生态系统内提供开源的俄语预训练模型和可重复的代码。

提出的方法

对俄语使用从多语言 BERT 模型初始化的 12 层 BERT-base Transformer 编码器（除了词嵌入以外的所有参数）。
创建一个新的俄语子词词汇表，使用 subword-nmt 在俄语维基百科和新闻数据上进行训练。
通过合并多语言和单语词汇表的交集来组装新嵌入；用重叠令牌的均值嵌入初始化新词标。
在用于构建单语词汇表的相同数据上训练单语俄语模型，批量大小 256，学习率 2e-5，Adam 优化器，L2 正则化 0.01。
在三项任务上评估：同义改写识别（ParaPhraser），情感分析（RuSentiment），以及问答（SDSJ Task B）。
比较多语言 BERT、从头开始训练的单语俄语模型，以及提出的 RuBERT。

实验结果

研究问题

RQ1单语俄语模型能否从多语言 BERT 权重初始化中受益？
RQ2多语言初始化是否会加速收敛并降低俄语单语模型的训练时间？
RQ3相比于多语言 BERT 和从头训练的模型，RuBERT 在俄语 NLP 任务中的表现如何？
RQ4语言特定的俄语词汇表对模型效率和性能有何影响？

主要发现

RuBERT 在所有评估的俄语任务（ParaPhraser 和 RuSentiment）以及问答任务上均优于多语言 BERT，最佳报道结果为：ParaPhraser F-1 87.73，准确率 84.99；RuSentiment F-1 84.60；SDSJ Task B QA EM 66.30。
多语言初始化比随机初始化收敛更快，大致需要 250k steps 即可达到与随机初始化 800k steps 相当的损失，节省约六天在 Tesla P100 x8 上的计算时间。
RuBERT 模型使用俄语特定词汇表（约 120k 个子词），使平均序列长度比多语言词汇表减少约 1.6 倍，从而可用更大的批量或更长的输入。
训练动态表明，多语言初始化提高了收敛速度和训练效率；新子词嵌入的平均化对收敛有积极影响。
可通过 DeepPavlov 库获得用于可重复性的俄语预训练模型和开源代码。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。