Skip to main content
QUICK REVIEW

[论文解读] MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei|arXiv (Cornell University)|Feb 25, 2020
Topic Modeling参考文献 57被引用 632
一句话总结

本文提出 MiniLM,一种任务无关的深度自注意蒸馏方法,通过模仿教师最后一层自注意及数值关系来压缩大型 TransformerLM,允许灵活的学生体系结构,在参数显著更少的情况下实现强性能。

ABSTRACT

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

研究动机与目标

  • 促使对大型预训练 Transformer LMs(如 BERT)的压缩,以实现更快的微调和服务。
  • 提出一个任务无关的蒸馏框架,深度模仿教师最后一层的自注意。
  • 引入自注意力数值关系作为额外的深层知识转移,无需额外参数。
  • 展示更小的学生(如 6 层,768 隐藏维度)在显著加速的情况下达到接近教师的性能。
  • 显示教师助手能进一步提升性能,特别是对于非常小的学生。

提出的方法

  • 训练一个学生以深度模仿教师最后一层 Transformer 的自注意模块。
  • 将自注意分布(查询–键)和数值之间的缩放点积(value-relations)作为知识进行转移。
  • 通过教师与学生的自注意分布之间的 KL 散度来计算注意力图转移损失。
  • 通过教师与学生的 value-relations 矩阵之间的 KL 散度来计算数值关系转移损失;此转移不需要额外参数。
  • 可选地使用教师助手(中等规模的学生)来弥合教师与学生之间的差距并改善性能。
  • 与现有的先前任务无关蒸馏方法进行比较,并展示最后一层、自注意数值关系和 TA 的优势。

实验结果

研究问题

  • RQ1仅模仿教师最后一层自注意力时,任务无关蒸馏是否仍然有效?
  • RQ2除了注意力分布外,还转移 value-relations 是否能带来更深的模仿和更好的学生表现?
  • RQ3引入教师助手是否能改善蒸馏,特别是对较小的学生?
  • RQ4该方法是否支持灵活的学生结构(层数和隐藏维度可变),而无需逐层映射?

主要发现

  • 一个从 BERT-BASE 蒸馏而来的6层、768隐藏的 MiniLM 学生在显著提高速度的同时,仍在 SQuAD 2.0 和 GLUE 任务上保持高性能。
  • 从教师最后一层转移注意力分布和 value-relations 相较仅使用注意力分布或其他基线,带来可衡量的提升。
  • value-relations 转移在不引入额外参数的情况下提供更深的自注意模仿,在多种任务和学生配置上提升结果。
  • 教师助手对较小的学生进一步提升性能,帮助缩小教师与学生之间的差距。
  • MiniLM 使多语言模型在显著减少 Transformer 参数的同时,仍具竞争力的性能。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。