QUICK REVIEW

[论文解读] The Implicit and Explicit Regularization Effects of Dropout

Colin Wei, Sham M. Kakade|arXiv (Cornell University)|Feb 28, 2020

Stochastic Gradient Optimization Techniques参考文献 68被引用 27

一句话总结

本文将dropout的正则化作用分解为显式和隐式两部分：显式正则化源于对期望损失的修改，而隐式正则化则源于dropout噪声引起的随机梯度更新。作者基于模型和损失函数的导数，推导出解析且可解释的正则化项，其在LSTM和Transformer模型的多个语言建模范式中，性能与dropout相当。

ABSTRACT

Dropout is a widely-used regularization technique, often required to obtain state-of-the-art for a number of architectures. This work demonstrates that dropout introduces two distinct but entangled regularization effects: an explicit effect (also studied in prior work) which occurs since dropout modifies the expected training objective, and, perhaps surprisingly, an additional implicit effect from the stochasticity in the dropout training update. This implicit regularization effect is analogous to the effect of stochasticity in small mini-batch stochastic gradient descent. We disentangle these two effects through controlled experiments. We then derive analytic simplifications which characterize each effect in terms of the derivatives of the model and the loss, for deep neural networks. We demonstrate these simplified, analytic regularizers accurately capture the important aspects of dropout, showing they faithfully replace dropout in practice.

研究动机与目标

识别并分离深度神经网络中dropout的显式和隐式正则化效应。
利用模型和损失函数的导数，对这两种效应进行理论表征。
开发简化且可解释的正则化项，以在实践中忠实复现dropout的性能。
通过实证验证，这些解析正则化项可在最先进的语言模型中完全替代dropout，且性能无损失。
为dropout为何在高词汇量设置（如语言建模）中表现优异提供新的洞见。

提出的方法

将显式正则化项定义为在dropout下期望损失与标准损失之间的差异，以捕捉dropout对训练目标的修改方式。
将隐式正则化效应识别为由dropout噪声引起的随机梯度更新所致，类似于小批量SGD。
利用损失和模型输出的二阶导数，推导隐式正则化项的解析近似。
提出一种结合正则化项，通过使用随机符号的随机近似，整合显式和隐式效应。
在LSTM和Transformer架构的训练流程中实现这些正则化项，并使用标准NLP基准进行评估。
通过受控实验，独立隔离并验证每种正则化效应。

实验结果

研究问题

RQ1显式和隐式正则化在dropout训练中的贡献有何不同？
RQ2dropout的隐式正则化效应如何通过模型和损失函数的导数进行解析表征？
RQ3能否推导出简化且可解释的正则化项，使其能完全替代dropout并保持泛化性能？
RQ4隐式正则化效应是否依赖于数据集大小或模型架构？
RQ5为何dropout在高词汇量设置（如语言建模）中特别有效？

主要发现

本文证明，dropout同时引发显式和隐式正则化效应，其中后者源于训练过程中由随机梯度噪声引起。
推导出的显式正则化项依赖于损失和模型的一阶与二阶导数，其对中等置信度预测（非接近0或1）的正则化作用最强。
隐式正则化项通过随机符号向量实现解析近似，捕捉了dropout因噪声引入的泛化优势。
在Penn Treebank、Wikitext-2和Wikitext-103数据集上，组合正则化项达到的验证困惑度与标准dropout相当（例如，在Penn Treebank上分别为72.99 vs. 73.76）。
隐式正则化效应在大型WikiText-103数据集上缺失，表明其依赖于数据集大小而非模型架构。
消融实验确认，显式正则化项对中等概率预测的关注是dropout有效性的关键因素。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。