QUICK REVIEW

[Paper Review] On the Inductive Bias of Dropout

David P. Helmbold, Philip M. Long|arXiv (Cornell University)|Dec 15, 2014

Stochastic Gradient Optimization Techniques16 references18 citations

TL;DR

This paper provides a theoretical analysis of dropout as a regularizer in linear classification, showing it induces a non-convex inductive bias that favors models with sparse, high-magnitude weights. Unlike L2 regularization, dropout's penalty is non-monotonic and non-convex, leading to stronger preference for rare features and distinct co-adaptation constraints.

ABSTRACT

Dropout is a simple but effective technique for learning in neural networks and other settings. A sound theoretical understanding of dropout is needed to determine when dropout should be applied and how to use it most effectively. In this paper we continue the exploration of dropout as a regularizer pioneered by Wager, et.al. We focus on linear classification where a convex proxy to the misclassification loss (i.e. the logistic loss used in logistic regression) is minimized. We show: (a) when the dropout-regularized criterion has a unique minimizer, (b) when the dropout-regularization penalty goes to infinity with the weights, and when it remains bounded, (c) that the dropout regularization can be non-monotonic as individual weights increase from 0, and (d) that the dropout regularization penalty may not be convex. This last point is particularly surprising because the combination of dropout regularization with any convex loss proxy is always a convex function. In order to contrast dropout regularization with $L_2$ regularization, we formalize the notion of when different sources are more compatible with different regularizers. We then exhibit distributions that are provably more compatible with dropout regularization than $L_2$ regularization, and vice versa. These sources provide additional insight into how the inductive biases of dropout and $L_2$ regularization differ. We provide some similar results for $L_1$ regularization.

Motivation & Objective

To understand the inductive bias of dropout in linear classification, particularly how it shapes model preferences during training.
To formally compare dropout regularization with L2 and L1 regularization in terms of their compatibility with different data distributions.
To investigate whether the dropout regularization penalty is convex, monotonic, or bounded as weights grow.
To provide theoretical justification for why dropout may outperform L2 regularization in certain data distributions.

Proposed method

Formalizes dropout as a stochastic perturbation of input features, where each feature is set to zero with probability q and scaled by 1/(1-q) otherwise.
Derives the dropout criterion as the expected logistic loss under the perturbed input distribution, decomposing it into a standard loss and a regularization term reg_D,q(w).
Analyzes the properties of reg_D,q(w), including its convexity, monotonicity, and behavior as individual weights increase from zero.
Constructs specific data distributions to demonstrate provable compatibility advantages of dropout over L2 regularization and vice versa.
Uses concentration inequalities and Berry-Esseen bounds to analyze the behavior of the regularization penalty in high-dimensional settings.
Employs a bias-variance decomposition framework, abstracting away sampling effects to focus on the inductive bias of the algorithm.

Experimental results

Research questions

RQ1How does dropout regularization compare to L2 and L1 regularization in terms of their inductive biases?
RQ2Is the dropout regularization penalty convex, monotonic, or bounded as weights grow?
RQ3Under what data distributions is dropout regularization provably more compatible than L2 regularization?
RQ4How does the dropout probability affect the strength and nature of regularization?
RQ5Why does dropout prefer rare features and restrict weight co-adaptation more effectively than L2 regularization?

Key findings

The dropout regularization penalty reg_D,q(w) is not convex, despite the overall objective being convex, revealing a non-convex inductive bias.
The regularization penalty can be non-monotonic as individual weights increase from zero, meaning increasing a weight may initially reduce the penalty.
The penalty can go to infinity with the weights under certain conditions, but may also remain bounded depending on the data distribution.
There exist data distributions that are provably more compatible with dropout regularization than with L2 regularization, and vice versa, demonstrating distinct inductive biases.
Dropout induces a stronger preference for models that assign very large weights to a single feature than L1 regularization does.
Theoretical analysis shows that dropout's inductive bias leads to a preference for sparse, high-magnitude weights, particularly in high-dimensional settings with rare features.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.