QUICK REVIEW

[论文解读] On the Inductive Bias of Dropout

David P. Helmbold, Philip M. Long|arXiv (Cornell University)|Dec 15, 2014

Stochastic Gradient Optimization Techniques参考文献 16被引用 18

一句话总结

本文对线性分类中的dropout作为正则化方法进行了理论分析，表明其引入了一种类非凸的归纳偏置，倾向于选择稀疏且大权重的模型。与L2正则化不同，dropout的惩罚项是非单调且非凸的，从而对罕见特征有更强的偏好，并施加了独特的共适应约束。

ABSTRACT

Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show: (a) when the dropout-regularized criterion has a unique minimizer, (b) when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded, (c) that the dropout regularization can be non-monotonic as individual weights increase from 0, and (d) that the dropout regularization penalty may not be convex. This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function. In order to contrast dropout regularization with $L_2$ regularization, we formalize the notion of when different sources are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than $L_2$ regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and $L_2$ regularization differ. We provide some similar results for $L_1$ regularization.

研究动机与目标

理解dropout在线性分类中的归纳偏置，特别是其在训练过程中如何影响模型偏好。
从形式上比较dropout正则化与L2和L1正则化在不同数据分布下的兼容性。
研究dropout正则化惩罚在权重增长过程中的凸性、单调性或有界性。
为dropout在某些数据分布下可能优于L2正则化提供理论依据。

提出的方法

将dropout形式化为对输入特征的随机扰动，其中每个特征以概率q被置零，其余特征则按1/(1-q)缩放。
将dropout准则定义为在扰动输入分布下的期望逻辑损失，将其分解为标准损失和正则化项reg_D,q(w)。
分析reg_D,q(w)的性质，包括其凸性、单调性，以及当单个权重从零开始增加时的行为。
构建特定的数据分布，以证明dropout正则化相较于L2正则化的兼容性优势，反之亦然。
利用浓度不等式和Berry-Esseen界，分析高维设置下正则化惩罚的行为。
采用偏差-方差分解框架，抽象出采样效应，专注于算法的归纳偏置。

实验结果

研究问题

RQ1dropout正则化与L2和L1正则化在归纳偏置方面有何异同？
RQ2当权重增长时，dropout正则化惩罚是否为凸、单调或有界？
RQ3在何种数据分布下，dropout正则化可被严格证明比L2正则化更具兼容性？
RQ4dropout概率如何影响正则化的强度和性质？
RQ5为何dropout比L2正则化更能偏好罕见特征并更有效地限制权重共适应？

主要发现

尽管整体目标函数是凸的，但dropout正则化惩罚reg_D,q(w)并非凸的，揭示了其非凸的归纳偏置。
当单个权重从零开始增加时，正则化惩罚可能呈现非单调性，即增加某个权重可能最初降低惩罚值。
在某些条件下，惩罚值可能随权重趋于无穷大，但其行为也可能在特定数据分布下保持有界。
存在可被严格证明更兼容dropout正则化的数据分布，也存在更兼容L2正则化的数据分布，表明二者具有截然不同的归纳偏置。
dropout诱导的偏好强于L1正则化，更倾向于选择仅对单一特征赋予极大权重的模型。
理论分析表明，dropout的归纳偏置导致对稀疏且大权重模型的偏好，尤其在高维设置下特征稀少时更为显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。