[论文解读] Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
论文表明对紧凑型 Transformer 模型进行预训练,随后从大型教师模型进行蒸馏并可选微调,在多种模型大小和数据条件下,与更复杂的压缩方法相比,取得竞争力甚至更优的表现。
Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.
研究动机与目标
- 证明在内存与延迟约束下,对紧凑模型进行预训练有利于端任务性能。
- 显示将预训练与蒸馏(及可选微调)相结合,在与现有压缩方法相比时具有竞争力甚至更优。
- 分析模型大小与无标签数据量如何影响预训练和蒸馏的收益。
- 研究在同一数据上顺序应用语言模型预训练与任务特定蒸馏时的相互作用。
- 提供一套预训练的小型 BERT 模型,以加速未来研究。
提出的方法
- 应用三步训练过程:在大型无标签语言模型语料库上的 MLM 预训练,使用软标签在无标签转移数据上的高容量教师进行蒸馏,以及在带标签数据上进行可选微调。
- 将 Pre-trained Distillation (PD) 与基线比较:基础训练、标准蒸馏,以及预训练再微调 (PF)。
- 在 24 种紧凑模型规模(4M 至 110M 参数)之间变化,并在不同无标签数据量和领域相似性条件下评估性能。
- 在 GLUE 风格任务及若干数据集(MNLI、RTE、SST-2、Book Reviews)上进行评估,以研究对转移数据规模和领域转移的鲁棒性。
- 分析预训练与蒸馏的复合效应,并与并行的模型压缩工作进行比较。
实验结果
研究问题
- RQ1对紧凑模型的 Transformer 层进行预训练是否能相对于标准蒸馏或 PF 基线提升端任务性能?
- RQ2模型大小以及无标签数据的规模/领域对预训练和蒸馏带来的收益有何影响?
- RQ3在同一数据上顺序应用 LM 预训练和蒸馏是否存在叠加收益?
- RQ4Pre-trained Distillation 对转移集大小以及标注数据与无标签数据之间的领域转移有多鲁棒?
主要发现
- Pre-training 加蒸馏(PD)在多个任务和多种模型规模上始终优于基线。
- 对于预训练的紧凑模型,深度比宽度更有价值;预训练使深度的利用更加有效。
- PD 能在远小于教师模型的模型和比普通蒸馏更少的转移数据下达到甚至超过教师的性能。
- PF 当转移集规模不显著大于标注集时,具有竞争力,但总体上 PD 更优,且对转移数据变动尤为鲁棒。
- PD 对标注数据与转移数据之间的领域转移比标准蒸馏更鲁棒,将预训练与蒸馏串联可产生叠加收益。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。