QUICK REVIEW

[论文解读] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen|arXiv (Cornell University)|Sep 26, 2019

Topic Modeling参考文献 49被引用 4,061

一句话总结

ALBERT 引入参数减少技术（因式化嵌入和跨层共享）以及一个句子顺序预测损失，创建更小但更强的语言模型，在 GLUE、SQuAD 和 RACE 上的参数数量比 BERT-large 少却达到最先进的结果。

ABSTRACT

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \\squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.

研究动机与目标

在不显著损失性能的情况下缓解大型预训练语言模型的内存与训练速度问题。
提出参数减少技术，在保持或提升准确性的同时显著降低参数数量。
引入自监督的句子顺序预测损失，以增强句间连贯性建模。

提出的方法

将词汇嵌入与隐藏维度分离的因式化嵌入参数化，将嵌入参数从 O(V×H) 降至 O(V×E+E×H)。
跨层参数共享，使所有 transformer 层共享参数以降低深度相关的参数增长。
引入句子顺序预测 (SOP) 损失，专注于句间连贯性而不是下一句预测（ NSP ）。
使用 MLM 和 SOP 损失在 BookCorpus 和 English Wikipedia 上进行 ALBERT 的预训练，词汇表大小为 30k，输入长度为 512。
在 GLUE、SQuAD 和 RACE 上进行微调评估；在对齐的设置下与 BERT 及其他基线进行比较。

实验结果

研究问题

RQ1ALBERT 是否能够在显著更少参数的情况下达到与 BERT 相似或更高的性能？
RQ2跨层参数共享与因式化嵌入是否对性能和训练效率产生显著影响？
RQ3面向连贯性的自监督预训练目标（SOP）是否比 NSP/其他目标对下游任务更有利？
RQ4在流行的 NLU 基准上，模型规模、训练速度和准确度之间的权衡是什么？

主要发现

ALBERT 相较 BERT-large 的参数量最多可低至 18 倍（ALBERT-xxlarge 235M vs BERT-large 334M），同时在若干任务上取得优越结果。
ALBERT 展现出显著的下游收益：SQuAD v1.1 +1.9，SQuAD v2.0 +3.1，MNLI +1.4，SST-2 +2.2，RACE +8.4（在开发集上）相对于 BERT-large。
ALBERT-xxlarge 在以更少的参数达到更高的 GLUE 和 SQuAD 分数且训练速度具备竞争力；例如，在 RACE 上 ALBERT-xxlarge 相对于 BERT-large 的平均分提升 8.4 点。
SOP 损失优于 NSP 和无设置，在多句编码任务中提供持续增益（大约 +1% 到 +2% 的平均值）。
移除 dropout 并添加外部数据进一歩提升大型 ALBERT 变体的 MLM 和下游性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。