Skip to main content
QUICK REVIEW

[论文解读] Reducing Transformer Depth on Demand with Structured Dropout

Angela Fan, Édouard Grave|arXiv (Cornell University)|Sep 25, 2019
Topic Modeling参考文献 56被引用 273
一句话总结

LayerDrop 训练一个单一的过参数化 Transformer,使得推理时可以在不微调的情况下提取任意子网络深度,从而在保持出色性能的同时实现高效的按需模型。

ABSTRACT

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality compared to training from scratch or using distillation.

研究动机与目标

  • 在 NLP 任务中说明对内存和计算效率高的 Transformer 模型的需求。
  • 引入一种训练正则化,使不同深度的子网络在无需微调的情况下具备鲁棒性。
  • 证明裁剪至更小深度在各基准上仍能获得有竞争力甚至是最先进的性能。

提出的方法

  • 通过抛弃与模型结构对齐的权重组(例如层)来应用随机结构化 dropout。
  • 聚焦于丢弃整个 Transformer 层(LayerDrop),以在推理时实现按需深度。
  • 描述裁剪策略(每隔一个、在有效性上搜索、数据驱动),并出于简单性和有效性偏好‘每隔一个’策略。
  • 给出用于目标深度的最优裁剪率 p* = 1 - r/N 的关系。
  • 一次性训练大型 Transformer 模型;在测试时无微调即可提取更浅的子网络。

实验结果

研究问题

  • RQ1Can LayerDrop regularize Transformers to be robust to layer-wise pruning at inference time?
  • RQ2How does on-demand depth via LayerDrop compare to training separate smaller models or distillation across NLP tasks?
  • RQ3What pruning strategies are effective for selecting which layers to keep when pruning?
  • RQ4Does LayerDrop enable state-of-the-art results across translation, language modeling, summarization, QA, and NLU benchmarks?

主要发现

  • LayerDrop regularizes very deep Transformers, stabilizing training and achieving strong results on multiple NLP benchmarks.
  • From one large pre-trained model, small, efficient sub-networks of any depth can be extracted at test time without finetuning.
  • LayerDrop-enabled pruning often outperforms training small models from scratch and standard pruning without LayerDrop, across generation and pre-training tasks.
  • Dropping entire layers is effective, with Every Other layer being a strong, simple pruning strategy across tasks.
  • Pruning RoBERTa-like models with LayerDrop yields better results than BERT/RoBERTa trained from scratch or distillation in several settings, especially with more data.
  • Training with larger LayerDrop improves performance when significant depth reduction is desired, aligning train-time and test-time conditions.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。