Understanding and Architecting Deep Neural Networks
Oktay, Deniz; McGreivy, Nick; Aduol, Joshua; Beatson, Alex; Adams, Ryan P.
Randomized Automatic Differentiation Conference
Proceedings of the International Conference on Learning Representations (ICLR), 2021.
@conference{oktay2021randomized,
title = {Randomized Automatic Differentiation},
author = {Deniz Oktay and Nick McGreivy and Joshua Aduol and Alex Beatson and Ryan P. Adams},
year = {2021},
date = {2021-04-01},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
abstract = {The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Ash, Jordan T.; Adams, Ryan P.
On warm-starting neural network training Conference
Advances in Neural Information Processing Systems 33 (NeurIPS), 2020.
@conference{ash2020warm,
title = {On warm-starting neural network training},
author = {Jordan T. Ash and Ryan P. Adams},
year = {2020},
date = {2020-12-01},
booktitle = {Advances in Neural Information Processing Systems 33 (NeurIPS)},
abstract = {In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate---to warm start'' the optimization rather than initialize from scratch---and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar. While it appears that some hyperparameter settings allow a practitioner to close this generalization gap, they seem to only do so in regimes that damage the wall-clock gains of the warm start. Nevertheless, it is highly desirable to be able to warm-start neural network training, as it would dramatically reduce the resource usage associated with the construction of performant deep learning systems. In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. We also provide a surprisingly simple trick that overcomes this pathology in several important situations, and present experiments that elucidate some of its properties.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Liu, Sulin; Sun, Xingyuan; Ramadge, Peter J.; Adams, Ryan P.
Task-agnostic amortized inference of Gaussian process hyperparameters Conference
Advances in Neural Information Processing Systems 33 (NeurIPS), 2020.
@conference{liu2020task,
title = {Task-agnostic amortized inference of Gaussian process hyperparameters},
author = {Sulin Liu and Xingyuan Sun and Peter J. Ramadge and Ryan P. Adams},
year = {2020},
date = {2020-12-01},
booktitle = {Advances in Neural Information Processing Systems 33 (NeurIPS)},
abstract = {Gaussian processes (GPs) are flexible priors for modeling functions. However, their success depends on the kernel accurately reflecting the properties of the data. One of the appeals of the GP framework is that the marginal likelihood of the kernel hyperparameters is often available in closed form, enabling optimization and sampling procedures to fit these hyperparameters to data. Unfortunately, point-wise evaluation of the marginal likelihood is expensive due to the need to solve a linear system; searching or sampling the space of hyperparameters thus often dominates the practical cost of using GPs. We introduce an approach to the identification of kernel hyperparameters in GP regression and related problems that sidesteps the need for costly marginal likelihoods. Our strategy is to "amortize" inference over hyperparameters by training a single neural network, which consumes a set of regression data and produces an estimate of the kernel function, useful across different tasks. To accommodate the varying dimension and cardinality of different regression problems, we use a hierarchical self-attention-based neural network that produces estimates of the hyperparameters which are invariant to the order of the input data points and data dimensions. We show that a single neural model trained on synthetic data is able to generalize directly to several different unseen real-world GP use cases. Our experiments demonstrate that the estimated hyperparameters are comparable in quality to those from the conventional model selection procedures, while being much faster to obtain, significantly accelerating GP regression and its related applications such as Bayesian optimization and Bayesian quadrature. The code and pre-trained model are available at https://github.com/PrincetonLIPS/AHGP.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Beatson, Alex; Ash, Jordan T.; Roeder, Geoffrey; Xue, Tianju; Adams, Ryan P.
Learning Composable Energy Surrogates for PDE Order Reduction Conference
Advances in Neural Information Processing Systems 33 (NeurIPS), 2020.
@conference{beatson2020composable,
title = {Learning Composable Energy Surrogates for PDE Order Reduction},
author = {Alex Beatson and Jordan T. Ash and Geoffrey Roeder and Tianju Xue and Ryan P. Adams},
url = {https://arxiv.org/abs/2005.06549},
year = {2020},
date = {2020-05-13},
booktitle = {Advances in Neural Information Processing Systems 33 (NeurIPS)},
abstract = {Meta-materials are an important emerging class of engineered materials in which complex macroscopic behaviour--whether electromagnetic, thermal, or mechanical--arises from modular substructure. Simulation and optimization of these materials are computationally challenging, as rich substructures necessitate high-fidelity finite element meshes to solve the governing PDEs. To address this, we leverage parametric modular structure to learn component-level surrogates, enabling cheaper high-fidelity simulation. We use a neural network to model the stored potential energy in a component given boundary conditions. This yields a structured prediction task: macroscopic behavior is determined by the minimizer of the system's total potential energy, which can be approximated by composing these surrogate models. Composable energy surrogates thus permit simulation in the reduced basis of component boundaries. Costly ground-truth simulation of the full structure is avoided, as training data are generated by performing finite element analysis with individual components. Using dataset aggregation to choose training boundary conditions allows us to learn energy surrogates which produce accurate macroscopic behavior when composed, accelerating simulation of parametric meta-materials.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Fedorov, Igor; Adams, Ryan P.; Mattina, Matthew; Whatmough, Paul N.
SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers Conference
Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
@conference{fedorov2019sparse,
title = {SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers},
author = {Igor Fedorov and
Ryan P. Adams and
Matthew Mattina and
Paul N. Whatmough},
url = {https://www.cs.princeton.edu/~rpa/pubs/fedorov2019sparse.pdf},
year = {2019},
date = {2019-12-04},
booktitle = {Advances in Neural Information Processing Systems 32 (NeurIPS)},
abstract = {The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNN inference on MCUs is challenging due to severe finite memory limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that do fit within the memory budget of an MCU, albeit at the cost of prediction accuracy. This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memory-limited MCUs. Our Sparse Architecture Search method combines neural architecture search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we find are more accurate and up to 4.35× smaller than previous approaches, while meeting the strict MCU working memory constraint.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Seff, Ari; Zhou, Wenda; Damani, Farhan; Doyle, Abigail; Adams, Ryan P.
Discrete Object Generation with Reversible Inductive Construction Conference
Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.
@conference{seff2019discrete,
title = {Discrete Object Generation with Reversible Inductive Construction},
author = {Ari Seff and
Wenda Zhou and
Farhan Damani and
Abigail Doyle and
Ryan P. Adams},
url = {https://www.cs.princeton.edu/~rpa/pubs/seff2019discrete.pdf},
year = {2019},
date = {2019-12-04},
booktitle = {Advances in Neural Information Processing Systems 32 (NeurIPS)},
abstract = {The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. Building off of generative interpretations of denoising autoencoders, the Markov chain alternates between producing 1) a sequence of corrupted objects that are valid but not from the data distribution, and 2) a learned reconstruction distribution that attempts to fix the corruptions while also preserving validity. This approach constrains the generative model to only produce valid objects, requires the learner to only discover local modifications to the objects, and avoids marginalization over an unknown and potentially large space of construction histories. We evaluate the proposed approach on two highly structured discrete domains, molecules and Laman graphs, and find that it compares favorably to alternative methods at capturing distributional statistics for a host of semantically relevant metrics.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Ash, Jordan T.; Adams, Ryan P.
On the Difficulty of Warm-Starting Neural Network Training Technical Report
2019.
@techreport{ash2018warm,
title = {On the Difficulty of Warm-Starting Neural Network Training},
author = {Jordan T. Ash and Ryan P. Adams},
url = {https://arxiv.org/abs/1910.08475},
year = {2019},
date = {2019-10-18},
abstract = {In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate---to "warm start" the optimization rather than initialize from scratch---and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar. While it appears that some hyperparameter settings allow a practitioner to close this generalization gap, they seem to only do so in regimes that damage the wall-clock gains of the warm start. Nevertheless, it is highly desirable to be able to warm-start neural network training, as it would dramatically reduce the resource usage associated with the construction of performant deep learning systems. In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. Although the present investigation did not lead to a solution, we hope that a thorough articulation of the problem will spur new research that may lead to improved methods that consume fewer resources during training.},
keywords = {},
pubstate = {published},
tppubtype = {techreport}
}
Beatson, Alex; Adams, Ryan P.
Efficient Optimization of Loops and Limits with Randomized Telescoping Sums Conference
Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
@conference{beatson2019efficient,
title = {Efficient Optimization of Loops and Limits with Randomized Telescoping Sums},
author = {Alex Beatson and
Ryan P. Adams},
url = {https://www.cs.princeton.edu/~rpa/pubs/beatson2019efficient.pdf},
year = {2019},
date = {2019-06-13},
booktitle = {Proceedings of the 36th International Conference on Machine Learning (ICML)},
abstract = {We consider optimization problems in which the objective requires an inner loop with many steps or is the limit of a sequence of increasingly costly approximations. Meta-learning, training recurrent neural networks, and optimization of the solutions to differential equations are all examples of optimization problems with this character. In such problems, it can be expensive to compute the objective function value and its gradient, but truncating the loop or using less accurate approximations can induce biases that damage the overall solution. We propose randomized telescope (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates. We identify conditions under which RT estimators achieve optimization convergence rates independent of the length of the loop or the required accuracy of the approximation. We also derive a method for tuning RT estimators online to maximize a lower bound on the expected decrease in loss per unit of computation. We evaluate our adaptive RT estimators on a range of applications including meta-optimization of learning rates, variational inference of ODE parameters, and training an LSTM to model long sequences.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Zhou, Wenda; Veitch, Victor; Austern, Morgane; Adams, Ryan P.; Orbanz, Peter
Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach Conference
Proceedings of the Seventh International Conference on Learning Representations (ICLR), 2019.
@conference{zhou2019nonvacuous,
title = {Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach},
author = {Wenda Zhou and
Victor Veitch and
Morgane Austern and
Ryan P. Adams and
Peter Orbanz},
url = {https://www.cs.princeton.edu/~rpa/pubs/zhou2019nonvacuous.pdf},
year = {2019},
date = {2019-04-18},
booktitle = {Proceedings of the Seventh International Conference on Learning Representations (ICLR)},
abstract = {Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be "compressed" to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size. Combined with off-the-shelf compression algorithms, the bound leads to state of the art generalization guarantees; in particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. As additional evidence connecting compression and generalization, we show that compressibility of models that tend to overfit is limited: We establish an absolute limit on expected compressibility as a function of expected generalization error, where the expectations are over the random choice of training examples. The bounds are complemented by empirical results that show an increase in overfitting implies an increase in the number of bits required to describe a trained network.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Wei, Jennifer N.; Belanger, David; Adams, Ryan P.; Sculley, D.
Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks Journal Article
In: ACS Central Science, vol. 5, no. 4, pp. 700-708, 2019.
@article{wei2019rapid,
title = {Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks},
author = {Jennifer N. Wei and
David Belanger and
Ryan P. Adams and
D. Sculley},
url = {https://www.cs.princeton.edu/~rpa/pubs/wei2019rapid.pdf},
year = {2019},
date = {2019-03-19},
journal = {ACS Central Science},
volume = {5},
number = {4},
pages = {700-708},
abstract = {When confronted with a substance of unknown identity, researchers often perform mass spectrometry on the sample and compare the observed spectrum to a library of previously collected spectra to identify the molecule. While popular, this approach will fail to identify molecules that are not in the existing library. In response, we propose to improve the library’s coverage by augmenting it with synthetic spectra that are predicted from candidate molecules using machine learning. We contribute a lightweight neural network model that quickly predicts mass spectra for small molecules, averaging 5 ms per molecule with a recall-at-10 accuracy of 91.8%. Achieving high-accuracy predictions requires a novel neural network architecture that is designed to capture typical fragmentation patterns from electron ionization. We analyze the effects of our modeling innovations on library matching performance and compare our models to prior machine-learning-based work on spectrum prediction.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Gilmer, Justin; Adams, Ryan P.; Goodfellow, Ian; Andersen, David; Dahl, George E.
Motivating the Rules of the Game for Adversarial Example Research Technical Report
2018.
@techreport{gilmer2018adversarial,
title = {Motivating the Rules of the Game for Adversarial Example Research},
author = {Justin Gilmer and Ryan P. Adams and Ian Goodfellow and David Andersen and George E. Dahl},
url = {https://arxiv.org/abs/1807.06732},
year = {2018},
date = {2018-07-18},
abstract = {Advances in machine learning have led to broad deployment of systems with impressive performance on important problems. Nonetheless, these systems can be induced to make errors on data that are surprisingly similar to examples the learned system handles correctly. The existence of these errors raises a variety of questions about out-of-sample generalization and whether bad actors might use such examples to abuse deployed systems. As a result of these security concerns, there has been a flurry of recent papers proposing algorithms to defend against such malicious perturbations of correctly handled examples. It is unclear how such misclassifications represent a different kind of security problem than other errors, or even other attacker-produced examples that have no specific relationship to an uncorrupted input. In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security. Towards this end, we establish a taxonomy of motivations, constraints, and abilities for more plausible adversaries. Finally, we provide a series of recommendations outlining a path forward for future work to more clearly articulate the threat model and perform more meaningful evaluation.},
keywords = {},
pubstate = {published},
tppubtype = {techreport}
}
Saeedi, Ardavan; Hoffman, Matthew D.; DiVerdi, Stephen J.; Ghandeharioun, Asma; Johnson, Matthew J.; Adams, Ryan P.
Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models Conference
Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018, (arXiv:1704.04997 [stat.ML]).
@conference{saeedi2018multimodal,
title = {Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models},
author = {Ardavan Saeedi and Matthew D. Hoffman and Stephen J. DiVerdi and Asma Ghandeharioun and Matthew J. Johnson and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/saeedi2018multimodal.pdf},
year = {2018},
date = {2018-01-01},
booktitle = {Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS)},
abstract = {Professional-grade software applications are powerful but
complicated−expert users can achieve impressive results, but
novices often struggle to complete even basic tasks. Photo
editing is a prime example: after loading a photo, the user is
confronted with an array of cryptic sliders like "clarity",
"temp", and "highlights". An automatically generated
suggestion could help, but there is no single "correct" edit
for a given image−different experts may make very different
aesthetic decisions when faced with the same image, and a
single expert may make different choices depending on the
intended use of the image (or on a whim). We therefore want a
system that can propose multiple diverse, high-quality edits
while also learning from and adapting to a user's aesthetic
preferences. In this work, we develop a statistical model that
meets these objectives. Our model builds on recent advances in
neural network generative modeling and scalable inference, and
uses hierarchical structure to learn editing patterns across
many diverse users. Empirically, we find that our model
outperforms other approaches on this challenging multimodal
prediction task.},
note = {arXiv:1704.04997 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
complicated−expert users can achieve impressive results, but
novices often struggle to complete even basic tasks. Photo
editing is a prime example: after loading a photo, the user is
confronted with an array of cryptic sliders like "clarity",
"temp", and "highlights". An automatically generated
suggestion could help, but there is no single "correct" edit
for a given image−different experts may make very different
aesthetic decisions when faced with the same image, and a
single expert may make different choices depending on the
intended use of the image (or on a whim). We therefore want a
system that can propose multiple diverse, high-quality edits
while also learning from and adapting to a user's aesthetic
preferences. In this work, we develop a statistical model that
meets these objectives. Our model builds on recent advances in
neural network generative modeling and scalable inference, and
uses hierarchical structure to learn editing patterns across
many diverse users. Empirically, we find that our model
outperforms other approaches on this challenging multimodal
prediction task.
Duvenaud, David; Maclaurin, Dougal; Adams, Ryan P.
Early Stopping is Nonparametric Variational Inference Conference
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2016, (arXiv:1504.01344 [stat.ML]).
@conference{duvenaud2016early,
title = {Early Stopping is Nonparametric Variational Inference},
author = {David Duvenaud and Dougal Maclaurin and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/duvenaud2016early.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)},
abstract = {We show that unconverged stochastic gradient descent can be
interpreted as a procedure that samples from a nonparametric
variational approximate posterior distribution. This
distribution is implicitly defined as the transformation of an
initial distribution by a sequence of optimization updates. By
tracking the change in entropy over this sequence of
transformations during optimization, we form a scalable,
unbiased estimate of the variational lower bound on the log
marginal likelihood. We can use this bound to optimize
hyperparameters instead of using cross-validation. This
Bayesian interpretation of SGD suggests improved,
overfitting-resistant optimization procedures, and gives a
theoretical foundation for popular tricks such as early
stopping and ensembling. We investigate the properties of this
marginal likelihood estimator on neural network models.},
note = {arXiv:1504.01344 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
interpreted as a procedure that samples from a nonparametric
variational approximate posterior distribution. This
distribution is implicitly defined as the transformation of an
initial distribution by a sequence of optimization updates. By
tracking the change in entropy over this sequence of
transformations during optimization, we form a scalable,
unbiased estimate of the variational lower bound on the log
marginal likelihood. We can use this bound to optimize
hyperparameters instead of using cross-validation. This
Bayesian interpretation of SGD suggests improved,
overfitting-resistant optimization procedures, and gives a
theoretical foundation for popular tricks such as early
stopping and ensembling. We investigate the properties of this
marginal likelihood estimator on neural network models.
Johnson, Matthew J.; Duvenaud, David; Wiltschko, Alexander B.; Datta, Sandeep Robert; Adams, Ryan P.
Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference Conference
Advances in Neural Information Processing Systems (NIPS) 29, 2016, (arXiv:1603.06277 [stat.ML]).
@conference{johnson2016svae,
title = {Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference},
author = {Matthew J. Johnson and David Duvenaud and Alexander B. Wiltschko and Sandeep Robert Datta and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/johnson2016svae.pdf},
year = {2016},
date = {2016-01-01},
booktitle = {Advances in Neural Information Processing Systems (NIPS) 29},
abstract = {We propose a general modeling and inference framework that
composes probabilistic graphical models with deep learning
methods and combines their respective strengths. Our model
family augments graphical structure in latent variables with
neural network observation models. For inference, we extend
variational autoencoders to use graphical model approximating
distributions with recognition networks that output conjugate
potentials. All components of these models are learned
simultaneously with a single objective, giving a scalable
algorithm that leverages stochastic variational inference,
natural gradients, graphical model message passing, and the
reparameterization trick. We illustrate this framework with
several example models and an application to mouse behavioral
phenotyping.},
note = {arXiv:1603.06277 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
composes probabilistic graphical models with deep learning
methods and combines their respective strengths. Our model
family augments graphical structure in latent variables with
neural network observation models. For inference, we extend
variational autoencoders to use graphical model approximating
distributions with recognition networks that output conjugate
potentials. All components of these models are learned
simultaneously with a single objective, giving a scalable
algorithm that leverages stochastic variational inference,
natural gradients, graphical model message passing, and the
reparameterization trick. We illustrate this framework with
several example models and an application to mouse behavioral
phenotyping.
Nemati, Shamim; Adams, Ryan P.
Identifying Outcome-Discriminative Dynamics in Multivariate Physiological Cohort Time Series Book Chapter
In: Advanced State Space Methods for Neural and Clinical Data, Cambridge University Press, Cambridge, UK, 2015.
@inbook{nemati2015identifying,
title = {Identifying Outcome-Discriminative Dynamics in Multivariate Physiological Cohort Time Series},
author = {Shamim Nemati and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/nemati2015identifying.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Advanced State Space Methods for Neural and Clinical Data},
publisher = {Cambridge University Press},
address = {Cambridge, UK},
abstract = {In this chapter, we present a learning algorithm specifically
designed to learn dynam- ical features of time series that are
directly predictive of the associated labels. Rather than
depending on label-free unsupervised learning to discover
relevant features of the time series, we build a system that
expressly learns the dynamics that are most rele- vant for
classifying time series labels. Our goal is to obtain compact
representations of nonstationary and multivariate time series
(representation learning)(Bengio, Courville & Vincent
2013). To accomplish this we use a connection between dynamic
bayesian networks (e.g., the switching VAR model) and
artificial neural networks (ANNs) to perform inference and
learning in state-space models in a manner analogous to back-
propagation in neural networks (Rumelhart, Hinton & Williams
1988). This connection stems from the observation that the
directed acyclic graph structure of a state-space model can be
unrolled both as a function of time and inference steps to
yield a deter- ministic neural network with efficient
parameter tying across time (see Fig. 1.2). Thus, the
parameters governing the dynamics and observation model of a
state-space model can be learned in a manner analogous to that
of a neural network. Indeed, the resulting system can be
viewed as a compactly-parameterized recurrent neural network
(RNN) (Sutskever 2013). Although the standard use of RNNs has
been for time series pre- diction (network output is the
predicted input time series in the future) or sequential
labeling (when output is a label sequence associated with the
input data sequence), with additional processing layers one
may obtain a time series classifier from this class of models
(Graves, Ferna ́ndez, Gomez & Schmidhuber 2006). Nevertheless,
RNNs have proven hard to train, since the optimization surface
tend to include multiple local min- ima. Moreover, standard
RNN are ’black box’ algorithms(as apposed to ’model-based’)
and therefore do allow for incorporation of physiological
models of the underlying sys- tems. The framework proposed
here addresses both these shortcomings. First, knowl- edge of
the underlying physiology can be directly incorporated into
the state-space mod- els that constitute the basic building
blocks of a dynamic Bayesian network. Secondly, equipped with
a generative model, we can rely on unsupervised pre-training
(via expec- tation maximization) to systematically initialize
the parameters of the equivalent RNN; in a manner analogous to
pre-training of very large neural networks (deep learning)
(Erhan, Bengio, Courville, Manzagol, Vincent & Bengio 2010).},
keywords = {},
pubstate = {published},
tppubtype = {inbook}
}
designed to learn dynam- ical features of time series that are
directly predictive of the associated labels. Rather than
depending on label-free unsupervised learning to discover
relevant features of the time series, we build a system that
expressly learns the dynamics that are most rele- vant for
classifying time series labels. Our goal is to obtain compact
representations of nonstationary and multivariate time series
(representation learning)(Bengio, Courville & Vincent
2013). To accomplish this we use a connection between dynamic
bayesian networks (e.g., the switching VAR model) and
artificial neural networks (ANNs) to perform inference and
learning in state-space models in a manner analogous to back-
propagation in neural networks (Rumelhart, Hinton & Williams
1988). This connection stems from the observation that the
directed acyclic graph structure of a state-space model can be
unrolled both as a function of time and inference steps to
yield a deter- ministic neural network with efficient
parameter tying across time (see Fig. 1.2). Thus, the
parameters governing the dynamics and observation model of a
state-space model can be learned in a manner analogous to that
of a neural network. Indeed, the resulting system can be
viewed as a compactly-parameterized recurrent neural network
(RNN) (Sutskever 2013). Although the standard use of RNNs has
been for time series pre- diction (network output is the
predicted input time series in the future) or sequential
labeling (when output is a label sequence associated with the
input data sequence), with additional processing layers one
may obtain a time series classifier from this class of models
(Graves, Ferna ́ndez, Gomez & Schmidhuber 2006). Nevertheless,
RNNs have proven hard to train, since the optimization surface
tend to include multiple local min- ima. Moreover, standard
RNN are ’black box’ algorithms(as apposed to ’model-based’)
and therefore do allow for incorporation of physiological
models of the underlying sys- tems. The framework proposed
here addresses both these shortcomings. First, knowl- edge of
the underlying physiology can be directly incorporated into
the state-space mod- els that constitute the basic building
blocks of a dynamic Bayesian network. Secondly, equipped with
a generative model, we can rely on unsupervised pre-training
(via expec- tation maximization) to systematically initialize
the parameters of the equivalent RNN; in a manner analogous to
pre-training of very large neural networks (deep learning)
(Erhan, Bengio, Courville, Manzagol, Vincent & Bengio 2010).
Maclaurin, Dougal; Duvenaud, David; Adams, Ryan P.
Gradient-based Hyperparameter Optimization through Reversible Learning Conference
Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, (arXiv:1502.03492 [stat.ML]).
@conference{maclaurin2015reversible,
title = {Gradient-based Hyperparameter Optimization through Reversible Learning},
author = {Dougal Maclaurin and David Duvenaud and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/maclaurin2015reversible.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},
abstract = {Tuning hyperparameters of learning algorithms is hard because
gradients are usually unavailable. We compute exact gradients
of cross-validation performance with respect to all
hyperparameters by chaining derivatives backwards through the
entire training procedure. These gradients allow us to
optimize thousands of hyperparameters, including step-size and
momentum schedules, weight initialization distributions,
richly parameterized regularization schemes, and neural
network architectures. We compute hyperparameter gradients by
exactly reversing the dynamics of stochastic gradient descent
with momentum.},
note = {arXiv:1502.03492 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
gradients are usually unavailable. We compute exact gradients
of cross-validation performance with respect to all
hyperparameters by chaining derivatives backwards through the
entire training procedure. These gradients allow us to
optimize thousands of hyperparameters, including step-size and
momentum schedules, weight initialization distributions,
richly parameterized regularization schemes, and neural
network architectures. We compute hyperparameter gradients by
exactly reversing the dynamics of stochastic gradient descent
with momentum.
Snoek, Jasper; Rippel, Oren; Swersky, Kevin; Kiros, Ryan; Satish, Nadathur; Sundaram, Narayanan; Patwary, Md. Mostofa Ali; Prabhat,; Adams, Ryan P.
Scalable Bayesian Optimization Using Deep Neural Networks Conference
Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, (arXiv:1502.05700 [stat.ML]).
@conference{snoek2015scalable,
title = {Scalable Bayesian Optimization Using Deep Neural Networks},
author = {Jasper Snoek and Oren Rippel and Kevin Swersky and Ryan Kiros and Nadathur Satish and Narayanan Sundaram and Md. Mostofa Ali Patwary and Prabhat and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/snoek2015scalable.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},
abstract = {Bayesian optimization is an effective methodology for the global
optimization of functions with expensive evaluations. It
relies on querying a distribution over functions defined by a
relatively cheap surrogate model. An accurate model for this
distribution over functions is critical to the effectiveness
of the approach, and is typically fit using Gaussian processes
(GPs). However, since GPs scale cubically with the number of
observations, it has been challenging to handle objectives
whose optimization requires many evaluations, and as such,
massively parallelizing the optimization. In this work, we
explore the use of neural networks as an alternative to GPs to
model distributions over functions. We show that performing
adaptive basis function regression with a neural network as
the parametric form performs competitively with
state-of-the-art GP-based approaches, but scales linearly with
the number of data rather than cubically. This allows us to
achieve a previously intractable degree of parallelism, which
we apply to large scale hyperparameter optimization, rapidly
finding competitive models on benchmark object recognition
tasks using convolutional networks, and image caption
generation using neural language models.},
note = {arXiv:1502.05700 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
optimization of functions with expensive evaluations. It
relies on querying a distribution over functions defined by a
relatively cheap surrogate model. An accurate model for this
distribution over functions is critical to the effectiveness
of the approach, and is typically fit using Gaussian processes
(GPs). However, since GPs scale cubically with the number of
observations, it has been challenging to handle objectives
whose optimization requires many evaluations, and as such,
massively parallelizing the optimization. In this work, we
explore the use of neural networks as an alternative to GPs to
model distributions over functions. We show that performing
adaptive basis function regression with a neural network as
the parametric form performs competitively with
state-of-the-art GP-based approaches, but scales linearly with
the number of data rather than cubically. This allows us to
achieve a previously intractable degree of parallelism, which
we apply to large scale hyperparameter optimization, rapidly
finding competitive models on benchmark object recognition
tasks using convolutional networks, and image caption
generation using neural language models.
Hernández-Lobato, José Miguel; Adams, Ryan P.
Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks Conference
Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, (arXiv:1502.05336 [stat.ML]).
@conference{lobato2015probabilistic,
title = {Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks},
author = {José Miguel Hernández-Lobato and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/lobato2015probabilistic.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},
abstract = {Large multilayer neural networks trained with backpropagation
have recently achieved state-of-the-art results in a wide
range of problems. However, using backprop for neural net
learning still has some disadvantages, e.g., having to tune a
large number of hyperparameters to the data, lack of
calibrated probabilistic predictions, and a tendency to
overfit the training data. In principle, the Bayesian approach
to learning neural networks does not have these
problems. However, existing Bayesian techniques lack
scalability to large dataset and network sizes. In this work
we present a novel scalable method for learning Bayesian
neural networks, called probabilistic backpropagation
(PBP). Similar to classical backpropagation, PBP works by
computing a forward propagation of probabilities through the
network and then doing a backward computation of gradients. A
series of experiments on ten real-world datasets show that PBP
is significantly faster than other techniques, while offering
competitive predictive abilities. Our experiments also show
that PBP provides accurate estimates of the posterior variance
on the network weights.},
note = {arXiv:1502.05336 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
have recently achieved state-of-the-art results in a wide
range of problems. However, using backprop for neural net
learning still has some disadvantages, e.g., having to tune a
large number of hyperparameters to the data, lack of
calibrated probabilistic predictions, and a tendency to
overfit the training data. In principle, the Bayesian approach
to learning neural networks does not have these
problems. However, existing Bayesian techniques lack
scalability to large dataset and network sizes. In this work
we present a novel scalable method for learning Bayesian
neural networks, called probabilistic backpropagation
(PBP). Similar to classical backpropagation, PBP works by
computing a forward propagation of probabilities through the
network and then doing a backward computation of gradients. A
series of experiments on ten real-world datasets show that PBP
is significantly faster than other techniques, while offering
competitive predictive abilities. Our experiments also show
that PBP provides accurate estimates of the posterior variance
on the network weights.
Rippel, Oren; Snoek, Jasper; Adams, Ryan P.
Spectral Representations for Convolutional Neural Networks Conference
Advances in Neural Information Processing Systems (NIPS) 28, 2015, (arXiv:1506.03767 [stat.ML]).
@conference{rippel2015spectral,
title = {Spectral Representations for Convolutional Neural Networks},
author = {Oren Rippel and Jasper Snoek and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/rippel2015spectral.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Advances in Neural Information Processing Systems (NIPS) 28},
abstract = {Discrete Fourier transforms provide a significant speedup in the
computation of convolutions in deep learning. In this work, we
demonstrate that, beyond its advantages for efficient
computation, the spectral domain also provides a powerful
representation in which to model and train convolutional
neural networks (CNNs). We employ spectral representations to
introduce a number of innovations to CNN design. First, we
propose spectral pooling, which performs dimensionality
reduction by truncating the representation in the frequency
domain. This approach preserves considerably more information
per parameter than other pooling strategies and enables
flexibility in the choice of pooling output
dimensionality. This representation also enables a new form of
stochastic regularization by randomized modification of
resolution. We show that these methods achieve competitive
results on classification and approximation tasks, without
using any dropout or max-pooling. Finally, we demonstrate the
effectiveness of complex-coefficient spectral parameterization
of convolutional filters. While this leaves the underlying
model unchanged, it results in a representation that greatly
facilitates optimization. We observe on a variety of popular
CNN configurations that this leads to significantly faster
convergence during training.},
note = {arXiv:1506.03767 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
computation of convolutions in deep learning. In this work, we
demonstrate that, beyond its advantages for efficient
computation, the spectral domain also provides a powerful
representation in which to model and train convolutional
neural networks (CNNs). We employ spectral representations to
introduce a number of innovations to CNN design. First, we
propose spectral pooling, which performs dimensionality
reduction by truncating the representation in the frequency
domain. This approach preserves considerably more information
per parameter than other pooling strategies and enables
flexibility in the choice of pooling output
dimensionality. This representation also enables a new form of
stochastic regularization by randomized modification of
resolution. We show that these methods achieve competitive
results on classification and approximation tasks, without
using any dropout or max-pooling. Finally, we demonstrate the
effectiveness of complex-coefficient spectral parameterization
of convolutional filters. While this leaves the underlying
model unchanged, it results in a representation that greatly
facilitates optimization. We observe on a variety of popular
CNN configurations that this leads to significantly faster
convergence during training.
Duvenaud, David; Maclaurin, Dougal; Aguilera-Iparraguirre, Jorge; Gómez-Bombarelli, Rafael; Hirzel, Timothy D.; Aspuru-Guzik, Alan; Adams, Ryan P.
Convolutional Networks on Graphs for Learning Molecular Fingerprints Conference
Advances in Neural Information Processing Systems (NIPS) 28, 2015, (arXiv:1509.09292 [stat.ML]).
@conference{duvenaud2015fingerprints,
title = {Convolutional Networks on Graphs for Learning Molecular Fingerprints},
author = {David Duvenaud and Dougal Maclaurin and Jorge Aguilera-Iparraguirre and Rafael Gómez-Bombarelli and Timothy D. Hirzel and Alan Aspuru-Guzik and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/duvenaud2015fingerprints.pdf},
year = {2015},
date = {2015-01-01},
booktitle = {Advances in Neural Information Processing Systems (NIPS) 28},
abstract = {We introduce a convolutional neural network that operates
directly on graphs. These networks allow end-to-end learning
of prediction pipelines whose inputs are graphs of arbitrary
size and shape. The architecture we present generalizes
standard molecular feature extraction methods based on
circular fingerprints. We show that these data-driven features
are more interpretable, and have better predictive performance
on a variety of tasks.},
note = {arXiv:1509.09292 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
directly on graphs. These networks allow end-to-end learning
of prediction pipelines whose inputs are graphs of arbitrary
size and shape. The architecture we present generalizes
standard molecular feature extraction methods based on
circular fingerprints. We show that these data-driven features
are more interpretable, and have better predictive performance
on a variety of tasks.
Duvenaud, David; Rippel, Oren; Adams, Ryan P.; Ghahramani, Zoubin
Avoiding Pathologies in Very Deep Networks Conference
Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 2014, (arXiv:1402.5836 [stat.ML]).
@conference{duvenaud2014pathologies,
title = {Avoiding Pathologies in Very Deep Networks},
author = {David Duvenaud and Oren Rippel and Ryan P. Adams and Zoubin Ghahramani},
url = {http://www.cs.princeton.edu/~rpa/pubs/duvenaud2014pathologies.pdf},
year = {2014},
date = {2014-01-01},
booktitle = {Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS)},
abstract = {Choosing appropriate architectures and regularization
strategies for deep networks is crucial to good predictive
performance. To shed light on this problem, we analyze the
analogous problem of constructing useful priors on
compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural
network. We show that in standard architectures, the
representational capacity of the network tends to capture
fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We
propose an alternate network architecture which does not
suffer from this pathology. We also examine deep covariance
functions, obtained by composing infinitely many feature
transforms. Lastly, we characterize the class of models
obtained by performing dropout on Gaussian processes.},
note = {arXiv:1402.5836 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
strategies for deep networks is crucial to good predictive
performance. To shed light on this problem, we analyze the
analogous problem of constructing useful priors on
compositions of functions. Specifically, we study the deep
Gaussian process, a type of infinitely-wide, deep neural
network. We show that in standard architectures, the
representational capacity of the network tends to capture
fewer degrees of freedom as the number of layers increases,
retaining only a single degree of freedom in the limit. We
propose an alternate network architecture which does not
suffer from this pathology. We also examine deep covariance
functions, obtained by composing infinitely many feature
transforms. Lastly, we characterize the class of models
obtained by performing dropout on Gaussian processes.
Rippel, Oren; Gelbart, Michael A.; Adams, Ryan P.
Learning Ordered Representations with Nested Dropout Conference
Proceedings of the 31st International Conference on Machine Learning (ICML), 2014, (arXiv:1402.0915 [stat.ML]).
@conference{rippel2014nested,
title = {Learning Ordered Representations with Nested Dropout},
author = {Oren Rippel and Michael A. Gelbart and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/rippel2014nested.pdf},
year = {2014},
date = {2014-01-01},
booktitle = {Proceedings of the 31st International Conference on Machine Learning (ICML)},
abstract = {In this paper, we study ordered representations of data in
which different dimensions have different degrees of
importance. To learn these representations we introduce nested
dropout, a procedure for stochastically removing coherent
nested sets of hidden units in a neural network. We first
present a sequence of theoretical results in the simple case
of a semi-linear autoencoder. We rigorously show that the
application of nested dropout enforces identifiability of the
units, which leads to an exact equivalence with PCA. We then
extend the algorithm to deep models and demonstrate the
relevance of ordered representations to a number of
applications. Specifically, we use the ordered property of the
learned codes to construct hash-based data structures that
permit very fast retrieval, achieving retrieval in time
logarithmic in the database size and independent of the
dimensionality of the representation. This allows codes that
are hundreds of times longer than currently feasible for
retrieval. We therefore avoid the diminished quality
associated with short codes, while still performing retrieval
that is competitive in speed with existing methods. We also
show that ordered representations are a promising way to learn
adaptive compression for efficient online data
reconstruction.},
note = {arXiv:1402.0915 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
which different dimensions have different degrees of
importance. To learn these representations we introduce nested
dropout, a procedure for stochastically removing coherent
nested sets of hidden units in a neural network. We first
present a sequence of theoretical results in the simple case
of a semi-linear autoencoder. We rigorously show that the
application of nested dropout enforces identifiability of the
units, which leads to an exact equivalence with PCA. We then
extend the algorithm to deep models and demonstrate the
relevance of ordered representations to a number of
applications. Specifically, we use the ordered property of the
learned codes to construct hash-based data structures that
permit very fast retrieval, achieving retrieval in time
logarithmic in the database size and independent of the
dimensionality of the representation. This allows codes that
are hundreds of times longer than currently feasible for
retrieval. We therefore avoid the diminished quality
associated with short codes, while still performing retrieval
that is competitive in speed with existing methods. We also
show that ordered representations are a promising way to learn
adaptive compression for efficient online data
reconstruction.
Rippel, Oren; Adams, Ryan P.
High-Dimensional Probability Estimation with Deep Density Models Unpublished
2013, (arXiv:1302.5125 [stat.ML]).
@unpublished{rippel2013density,
title = {High-Dimensional Probability Estimation with Deep Density Models},
author = {Oren Rippel and Ryan P. Adams},
url = {http://www.cs.princeton.edu/~rpa/pubs/rippel2013density.pdf},
year = {2013},
date = {2013-01-01},
abstract = {One of the fundamental problems in machine learning is the
estimation of a probability distribution from data. Many
techniques have been proposed to study the structure of data,
most often building around the assumption that observations
lie on a lower-dimensional manifold of high probability. It
has been more difficult, however, to exploit this insight to
build explicit, tractable density models for high-dimensional
data. In this paper, we introduce the deep density model
(DDM), a new approach to density estimation. We exploit
insights from deep learning to construct a bijective map to a
representation space, under which the transformation of the
distribution of the data is approximately factorized and has
identical and known marginal densities. The simplicity of the
latent distribution under the model allows us to feasibly
explore it, and the invertibility of the map to characterize
contraction of measure across it. This enables us to compute
normalized densities for out-of-sample data. This combination
of tractability and flexibility allows us to tackle a variety
of probabilistic tasks on high-dimensional datasets,
including: rapid computation of normalized densities at
test-time without evaluating a partition function; generation
of samples without MCMC; and characterization of the joint
entropy of the data.},
note = {arXiv:1302.5125 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {unpublished}
}
estimation of a probability distribution from data. Many
techniques have been proposed to study the structure of data,
most often building around the assumption that observations
lie on a lower-dimensional manifold of high probability. It
has been more difficult, however, to exploit this insight to
build explicit, tractable density models for high-dimensional
data. In this paper, we introduce the deep density model
(DDM), a new approach to density estimation. We exploit
insights from deep learning to construct a bijective map to a
representation space, under which the transformation of the
distribution of the data is approximately factorized and has
identical and known marginal densities. The simplicity of the
latent distribution under the model allows us to feasibly
explore it, and the invertibility of the map to characterize
contraction of measure across it. This enables us to compute
normalized densities for out-of-sample data. This combination
of tractability and flexibility allows us to tackle a variety
of probabilistic tasks on high-dimensional datasets,
including: rapid computation of normalized densities at
test-time without evaluating a partition function; generation
of samples without MCMC; and characterization of the joint
entropy of the data.
Snoek, Jasper; Adams, Ryan P.; Larochelle, Hugo
On Nonparametric Guidance for Learning Autoencoder Representations Conference
Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012, (arXiv:1102.1492v4 [stat.ML]).
@conference{snoek2012guidance,
title = {On Nonparametric Guidance for Learning Autoencoder Representations},
author = {Jasper Snoek and Ryan P. Adams and Hugo Larochelle},
url = {http://www.cs.princeton.edu/~rpa/pubs/snoek2012guidance.pdf},
year = {2012},
date = {2012-01-01},
booktitle = {Proceedings of the 15th International Conference on Artificial
Intelligence and Statistics (AISTATS)},
abstract = {Unsupervised discovery of latent representations, in addition
to being useful for density modeling, visualisation and
exploratory data analysis, is also increasingly important for
learning features relevant to discriminative
tasks. Autoencoders, in particular, have proven to be an
effective way to learn latent codes that reflect meaningful
variations in data. A continuing challenge, however, is
guiding an autoencoder toward representations that are useful
for particular tasks. A complementary challenge is to find
codes that are invariant to irrelevant transformations of the
data. The most common way of introducing such problem-specific
guidance in autoencoders has been through the incorporation of
a parametric component that ties the latent representation to
the label information. In this work, we argue that a
preferable approach relies instead on a nonparametric guidance
mechanism. Conceptually, it ensures that there exists a
function that can predict the label information, without
explicitly instantiating that function. The superiority of
this guidance mechanism is confirmed on two datasets. In
particular, this approach is able to incorporate invariance
information (lighting, elevation, etc.) from the small NORB
object recognition dataset and yields state-of-the-art
performance for a single layer, non-convolutional network.},
note = {arXiv:1102.1492v4 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
to being useful for density modeling, visualisation and
exploratory data analysis, is also increasingly important for
learning features relevant to discriminative
tasks. Autoencoders, in particular, have proven to be an
effective way to learn latent codes that reflect meaningful
variations in data. A continuing challenge, however, is
guiding an autoencoder toward representations that are useful
for particular tasks. A complementary challenge is to find
codes that are invariant to irrelevant transformations of the
data. The most common way of introducing such problem-specific
guidance in autoencoders has been through the incorporation of
a parametric component that ties the latent representation to
the label information. In this work, we argue that a
preferable approach relies instead on a nonparametric guidance
mechanism. Conceptually, it ensures that there exists a
function that can predict the label information, without
explicitly instantiating that function. The superiority of
this guidance mechanism is confirmed on two datasets. In
particular, this approach is able to incorporate invariance
information (lighting, elevation, etc.) from the small NORB
object recognition dataset and yields state-of-the-art
performance for a single layer, non-convolutional network.
Dahl, George E.; Adams, Ryan P.; Larochelle, Hugo
Training Restricted Boltzmann Machines on Word Observations Conference
Proceedings of the 29th International Conference on Machine Learning (ICML), 2012, (arXiv:1202.5695 [cs.LG]).
@conference{dahl2012training,
title = {Training Restricted Boltzmann Machines on Word Observations},
author = {George E. Dahl and Ryan P. Adams and Hugo Larochelle},
url = {http://www.cs.princeton.edu/~rpa/pubs/dahl2012training.pdf},
year = {2012},
date = {2012-01-01},
booktitle = {Proceedings of the 29th International Conference on Machine
Learning (ICML)},
abstract = {The restricted Boltzmann machine (RBM) is a flexible model for
complex data. However, using RBMs for high-dimensional
multinomial observations poses significant computational
di␣culties. In natural language processing applications, words
are naturally modeled by K-ary discrete distributions, where K
is determined by the vocabulary size and can easily be in the
hundred thousands. The conventional approach to training RBMs
on word observations is limited because it requires sampling
the states of K-way softmax visible units during block Gibbs
updates, an operation that takes time linear in K. In this
work, we address this issue with a more general class of
Markov chain Monte Carlo operators on the visible units,
yielding updates with computational complexity independent of
K. We demonstrate the success of our approach by training RBMs
on hundreds of millions of word n-grams using larger
vocabularies than previously feasible with RBMs and by using
the learned features to improve performance on chunking and
sentiment classification tasks, achieving state-of-the-art
results on the latter.},
note = {arXiv:1202.5695 [cs.LG]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
complex data. However, using RBMs for high-dimensional
multinomial observations poses significant computational
di␣culties. In natural language processing applications, words
are naturally modeled by K-ary discrete distributions, where K
is determined by the vocabulary size and can easily be in the
hundred thousands. The conventional approach to training RBMs
on word observations is limited because it requires sampling
the states of K-way softmax visible units during block Gibbs
updates, an operation that takes time linear in K. In this
work, we address this issue with a more general class of
Markov chain Monte Carlo operators on the visible units,
yielding updates with computational complexity independent of
K. We demonstrate the success of our approach by training RBMs
on hundreds of millions of word n-grams using larger
vocabularies than previously feasible with RBMs and by using
the learned features to improve performance on chunking and
sentiment classification tasks, achieving state-of-the-art
results on the latter.
Snoek, Jasper; Adams, Ryan P.; Larochelle, Hugo
Nonparametric Guidance of Autoencoder Representations Using Label Information Journal Article
In: Journal of Machine Learning Research, vol. 13, pp. 2567–2588, 2012.
@article{snoek2012autoencoder,
title = {Nonparametric Guidance of Autoencoder Representations Using
Label Information},
author = {Jasper Snoek and Ryan P. Adams and Hugo Larochelle},
url = {http://www.cs.princeton.edu/~rpa/pubs/snoek2012autoencoder.pdf},
year = {2012},
date = {2012-01-01},
journal = {Journal of Machine Learning Research},
volume = {13},
pages = {2567--2588},
abstract = {While unsupervised learning has long been useful for density
modeling, exploratory data analysis and visualization, it has
become increasingly important for discovering features that
will later be used for discriminative tasks. Discriminative
algorithms often work best with highly-informative features;
remarkably, such features can often be learned without the
labels. One particularly effective way to perform such
unsupervised learning has been to use autoencoder neural
networks, which find latent representations that are
constrained but nevertheless informative for
reconstruction. However, pure unsupervised learning with
autoencoders can find representations that may or may not be
useful for the ultimate discriminative task. It is a
continuing challenge to guide the training of an autoencoder
so that it finds features which will be useful for predicting
labels. Similarly, we often have a priori information
regarding what statistical variation will be irrelevant to the
ultimate discriminative task, and we would like to be able to
use this for guidance as well. Although a typical strategy
would be to include a parametric discriminative model as part
of the autoencoder training, here we propose a nonparametric
approach that uses a Gaussian process to guide the
representation. By using a nonparametric model, we can ensure
that a useful discriminative function exists for a given set
of features, without explicitly instantiating it. We
demonstrate the superiority of this guidance mechanism on four
data sets, including a real-world application to
rehabilitation research. We also show how our proposed
approach can learn to explicitly ignore statistically
significant covariate information that is label-irrelevant, by
evaluating on the small NORB image recognition problem in
which pose and lighting labels are available.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
modeling, exploratory data analysis and visualization, it has
become increasingly important for discovering features that
will later be used for discriminative tasks. Discriminative
algorithms often work best with highly-informative features;
remarkably, such features can often be learned without the
labels. One particularly effective way to perform such
unsupervised learning has been to use autoencoder neural
networks, which find latent representations that are
constrained but nevertheless informative for
reconstruction. However, pure unsupervised learning with
autoencoders can find representations that may or may not be
useful for the ultimate discriminative task. It is a
continuing challenge to guide the training of an autoencoder
so that it finds features which will be useful for predicting
labels. Similarly, we often have a priori information
regarding what statistical variation will be irrelevant to the
ultimate discriminative task, and we would like to be able to
use this for guidance as well. Although a typical strategy
would be to include a parametric discriminative model as part
of the autoencoder training, here we propose a nonparametric
approach that uses a Gaussian process to guide the
representation. By using a nonparametric model, we can ensure
that a useful discriminative function exists for a given set
of features, without explicitly instantiating it. We
demonstrate the superiority of this guidance mechanism on four
data sets, including a real-world application to
rehabilitation research. We also show how our proposed
approach can learn to explicitly ignore statistically
significant covariate information that is label-irrelevant, by
evaluating on the small NORB image recognition problem in
which pose and lighting labels are available.
Adams, Ryan P.; Wallach, Hanna M.; Ghahramani, Zoubin
Learning the Structure of Deep Sparse Graphical Models Conference
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, (arXiv:1001.0160 [stat.ML]).
@conference{adams2010deep,
title = {Learning the Structure of Deep Sparse Graphical Models},
author = {Ryan P. Adams and Hanna M. Wallach and Zoubin Ghahramani},
url = {http://www.cs.princeton.edu/~rpa/pubs/adams2010deep.pdf},
year = {2010},
date = {2010-01-01},
booktitle = {Proceedings of the 13th International Conference on Artificial
Intelligence and Statistics (AISTATS)},
abstract = {Deep belief networks are a powerful way to model complex
probability distributions. However, learning the structure of
a belief network, particularly one with hidden units, is
difficult. The Indian buffet process has been used as a
nonparametric Bayesian prior on the directed structure of a
belief network with a single infinitely wide hidden layer. In
this paper, we introduce the cascading Indian buffet process
(CIBP), which provides a nonparametric prior on the structure
of a layered, directed belief network that is unbounded in
both depth and width, yet allows tractable inference. We use
the CIBP prior with the nonlinear Gaussian belief network so
each unit can additionally vary its behavior between discrete
and continuous representations. We provide Markov chain Monte
Carlo algorithms for inference in these belief networks and
explore the structures learned on several image data sets.},
note = {arXiv:1001.0160 [stat.ML]},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
probability distributions. However, learning the structure of
a belief network, particularly one with hidden units, is
difficult. The Indian buffet process has been used as a
nonparametric Bayesian prior on the directed structure of a
belief network with a single infinitely wide hidden layer. In
this paper, we introduce the cascading Indian buffet process
(CIBP), which provides a nonparametric prior on the structure
of a layered, directed belief network that is unbounded in
both depth and width, yet allows tractable inference. We use
the CIBP prior with the nonlinear Gaussian belief network so
each unit can additionally vary its behavior between discrete
and continuous representations. We provide Markov chain Monte
Carlo algorithms for inference in these belief networks and
explore the structures learned on several image data sets.