Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.
  @conference{maclaurin2015reversible,
  year = {2015},
  author = {Maclaurin, Dougal and Duvenaud, David and Adams, Ryan P.},
  title = {Gradient-based Hyperparameter Optimization through Reversible Learning},
  booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},
  note = {arXiv:1502.03492 [stat.ML]},
  keywords = {ICML, deep learning}
}
  