In this post, I'll summarize one of my favorite papers from ICML 2013: Fast Dropout Training, by Sida Wang and Christopher Manning. This paper derives an analytic approximation to dropout, a randomized regularization method recently proposed for training deep nets that has allowed big improvements in predictive accuracy. Their approximation gives a roughly 10-times speedup under certain conditions. Much more interestingly, the authors also show strong connections to existing regularization methods, shedding light on why dropout works so well.
The idea behind dropout is very simple: during each evaluation of the neural net during training, half the inputs and hidden units are randomly 'dropped' (set to zero). This simple technique allowed large improvements in predictive accuracy for deep nets. However, it was not immediately clear why it works so well. The original paper gives several intuitions about why dropout works. One proposed explanation is that individual hidden units are discouraged from developing complex (and possibly brittle) dependencies on each other. We'll return to that question later.
The key insight in the fast dropout paper is this: If the input to each node in a neural network is a weighted sum of its inputs - and if some of those inputs are randomly being set to zero, then the total input is actually a weighted sum of Bernoulli random variables. By the Central Limit Theorem, this sum can be well-approximated by a Gaussian when a neuron has many inputs with comparable variance. The authors derive the mean and variance of these Gaussians, and can then approximately integrate over all exponentially-many combinations of dropouts in a one-layer network. They also approximate the output of each neuron with a Gaussian, by locally-linearizing the nonlinearity in each neuron, and derive deterministic update rules for the multi-layer case. This trick leads to a roughly 10-times speedup in training times. More interesting than the speedup, though, is that this approximate dropout technique is equivalent in some ways to standard approaches.
The first equivalence shown is: if you normalize your data features to all have the same variance, then ridge regression is exactly equivalent to least-squares linear regression using dropout! This means that dropout is a regularization technique that's invariant to rescaling of the inputs, which may be one reason why it works so well. In deep nets, this may be especially helpful, since the scale of the hidden node activations may change during training.
The second equivalence shown is that a one-layer neural network trained using dropout is approximately optimizing a lower bound on the model evidence of a logistic regression model. This result suggests that dropout is closely related to expectation-maximization (or, more generally, variational Bayes). The original dropout paper mentions a similar result - for one-layer networks, dropout approximately optimizes the geometric mean of the likelihood of all possible ways of removing nodes from the network - but the explicit derivations in the fast dropout paper make the link much clearer.
So, this paper gives us two new ways to think about dropout - as a scale-invariant regularizer, and as an approximate variational method. The success of dropout suggests that, even for large datasets, regularization of deep nets is still very important. Perhaps we should take a second look at variational inference in deep Bayesian neural networks, or further investigate scale-invariant regularization methods.
Thanks to Oren Rippel and Roger Grosse for helpful comments.
Update: Kevin Swersky pointed out a follow-up paper with even more connections: Dropout Training as Adaptive Regularization