# Understanding and Architecting Deep Neural Networks

Oktay, Deniz; McGreivy, Nick; Aduol, Joshua; Beatson, Alex; Adams, Ryan P.

Randomized Automatic Differentiation Conference

Proceedings of the International Conference on Learning Representations (ICLR), 2021.

@conference{oktay2021randomized,

title = {Randomized Automatic Differentiation},

author = {Deniz Oktay and Nick McGreivy and Joshua Aduol and Alex Beatson and Ryan P. Adams},

year = {2021},

date = {2021-04-01},

booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},

abstract = {The successes of deep learning, variational inference, and many other fields have been aided by specialized implementations of reverse-mode automatic differentiation (AD) to compute gradients of mega-dimensional objectives. The AD techniques underlying these tools were designed to compute exact gradients to numerical precision, but modern machine learning models are almost always trained with stochastic gradient descent. Why spend computation and memory on exact (minibatch) gradients only to use them for stochastic optimization? We develop a general framework and approach for randomized automatic differentiation (RAD), which can allow unbiased gradient estimates to be computed with reduced memory in return for variance. We examine limitations of the general approach, and argue that we must leverage problem specific structure to realize benefits. We develop RAD techniques for a variety of simple neural network architectures, and show that for a fixed memory budget, RAD converges in fewer iterations than using a small batch size for feedforward networks, and in a similar number for recurrent networks. We also show that RAD can be applied to scientific computing, and use it to develop a low-memory stochastic gradient method for optimizing the control parameters of a linear reaction-diffusion PDE representing a fission reactor.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Ash, Jordan T.; Adams, Ryan P.

On warm-starting neural network training Conference

Advances in Neural Information Processing Systems 33 (NeurIPS), 2020.

@conference{ash2020warm,

title = {On warm-starting neural network training},

author = {Jordan T. Ash and Ryan P. Adams},

year = {2020},

date = {2020-12-01},

booktitle = {Advances in Neural Information Processing Systems 33 (NeurIPS)},

abstract = {In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate---to warm start'' the optimization rather than initialize from scratch---and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar. While it appears that some hyperparameter settings allow a practitioner to close this generalization gap, they seem to only do so in regimes that damage the wall-clock gains of the warm start. Nevertheless, it is highly desirable to be able to warm-start neural network training, as it would dramatically reduce the resource usage associated with the construction of performant deep learning systems. In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. We also provide a surprisingly simple trick that overcomes this pathology in several important situations, and present experiments that elucidate some of its properties.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Liu, Sulin; Sun, Xingyuan; Ramadge, Peter J.; Adams, Ryan P.

Task-agnostic amortized inference of Gaussian process hyperparameters Conference

Advances in Neural Information Processing Systems 33 (NeurIPS), 2020.

@conference{liu2020task,

title = {Task-agnostic amortized inference of Gaussian process hyperparameters},

author = {Sulin Liu and Xingyuan Sun and Peter J. Ramadge and Ryan P. Adams},

year = {2020},

date = {2020-12-01},

booktitle = {Advances in Neural Information Processing Systems 33 (NeurIPS)},

abstract = {Gaussian processes (GPs) are flexible priors for modeling functions. However, their success depends on the kernel accurately reflecting the properties of the data. One of the appeals of the GP framework is that the marginal likelihood of the kernel hyperparameters is often available in closed form, enabling optimization and sampling procedures to fit these hyperparameters to data. Unfortunately, point-wise evaluation of the marginal likelihood is expensive due to the need to solve a linear system; searching or sampling the space of hyperparameters thus often dominates the practical cost of using GPs. We introduce an approach to the identification of kernel hyperparameters in GP regression and related problems that sidesteps the need for costly marginal likelihoods. Our strategy is to "amortize" inference over hyperparameters by training a single neural network, which consumes a set of regression data and produces an estimate of the kernel function, useful across different tasks. To accommodate the varying dimension and cardinality of different regression problems, we use a hierarchical self-attention-based neural network that produces estimates of the hyperparameters which are invariant to the order of the input data points and data dimensions. We show that a single neural model trained on synthetic data is able to generalize directly to several different unseen real-world GP use cases. Our experiments demonstrate that the estimated hyperparameters are comparable in quality to those from the conventional model selection procedures, while being much faster to obtain, significantly accelerating GP regression and its related applications such as Bayesian optimization and Bayesian quadrature. The code and pre-trained model are available at https://github.com/PrincetonLIPS/AHGP.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Beatson, Alex; Ash, Jordan T.; Roeder, Geoffrey; Xue, Tianju; Adams, Ryan P.

Learning Composable Energy Surrogates for PDE Order Reduction Conference

Advances in Neural Information Processing Systems 33 (NeurIPS), 2020.

@conference{beatson2020composable,

title = {Learning Composable Energy Surrogates for PDE Order Reduction},

author = {Alex Beatson and Jordan T. Ash and Geoffrey Roeder and Tianju Xue and Ryan P. Adams},

url = {https://arxiv.org/abs/2005.06549},

year = {2020},

date = {2020-05-13},

booktitle = {Advances in Neural Information Processing Systems 33 (NeurIPS)},

abstract = {Meta-materials are an important emerging class of engineered materials in which complex macroscopic behaviour--whether electromagnetic, thermal, or mechanical--arises from modular substructure. Simulation and optimization of these materials are computationally challenging, as rich substructures necessitate high-fidelity finite element meshes to solve the governing PDEs. To address this, we leverage parametric modular structure to learn component-level surrogates, enabling cheaper high-fidelity simulation. We use a neural network to model the stored potential energy in a component given boundary conditions. This yields a structured prediction task: macroscopic behavior is determined by the minimizer of the system's total potential energy, which can be approximated by composing these surrogate models. Composable energy surrogates thus permit simulation in the reduced basis of component boundaries. Costly ground-truth simulation of the full structure is avoided, as training data are generated by performing finite element analysis with individual components. Using dataset aggregation to choose training boundary conditions allows us to learn energy surrogates which produce accurate macroscopic behavior when composed, accelerating simulation of parametric meta-materials.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Fedorov, Igor; Adams, Ryan P.; Mattina, Matthew; Whatmough, Paul N.

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers Conference

Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.

@conference{fedorov2019sparse,

title = {SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers},

author = {Igor Fedorov and

Ryan P. Adams and

Matthew Mattina and

Paul N. Whatmough},

url = {https://www.cs.princeton.edu/~rpa/pubs/fedorov2019sparse.pdf},

year = {2019},

date = {2019-12-04},

booktitle = {Advances in Neural Information Processing Systems 32 (NeurIPS)},

abstract = {The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNN inference on MCUs is challenging due to severe finite memory limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that do fit within the memory budget of an MCU, albeit at the cost of prediction accuracy. This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memory-limited MCUs. Our Sparse Architecture Search method combines neural architecture search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we find are more accurate and up to 4.35× smaller than previous approaches, while meeting the strict MCU working memory constraint.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Seff, Ari; Zhou, Wenda; Damani, Farhan; Doyle, Abigail; Adams, Ryan P.

Discrete Object Generation with Reversible Inductive Construction Conference

Advances in Neural Information Processing Systems 32 (NeurIPS), 2019.

@conference{seff2019discrete,

title = {Discrete Object Generation with Reversible Inductive Construction},

author = {Ari Seff and

Wenda Zhou and

Farhan Damani and

Abigail Doyle and

Ryan P. Adams},

url = {https://www.cs.princeton.edu/~rpa/pubs/seff2019discrete.pdf},

year = {2019},

date = {2019-12-04},

booktitle = {Advances in Neural Information Processing Systems 32 (NeurIPS)},

abstract = {The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity. Building off of generative interpretations of denoising autoencoders, the Markov chain alternates between producing 1) a sequence of corrupted objects that are valid but not from the data distribution, and 2) a learned reconstruction distribution that attempts to fix the corruptions while also preserving validity. This approach constrains the generative model to only produce valid objects, requires the learner to only discover local modifications to the objects, and avoids marginalization over an unknown and potentially large space of construction histories. We evaluate the proposed approach on two highly structured discrete domains, molecules and Laman graphs, and find that it compares favorably to alternative methods at capturing distributional statistics for a host of semantically relevant metrics.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Ash, Jordan T.; Adams, Ryan P.

On the Difficulty of Warm-Starting Neural Network Training Technical Report

2019.

@techreport{ash2018warm,

title = {On the Difficulty of Warm-Starting Neural Network Training},

author = {Jordan T. Ash and Ryan P. Adams},

url = {https://arxiv.org/abs/1910.08475},

year = {2019},

date = {2019-10-18},

abstract = {In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate---to "warm start" the optimization rather than initialize from scratch---and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar. While it appears that some hyperparameter settings allow a practitioner to close this generalization gap, they seem to only do so in regimes that damage the wall-clock gains of the warm start. Nevertheless, it is highly desirable to be able to warm-start neural network training, as it would dramatically reduce the resource usage associated with the construction of performant deep learning systems. In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. Although the present investigation did not lead to a solution, we hope that a thorough articulation of the problem will spur new research that may lead to improved methods that consume fewer resources during training.},

keywords = {},

pubstate = {published},

tppubtype = {techreport}

}

Beatson, Alex; Adams, Ryan P.

Efficient Optimization of Loops and Limits with Randomized Telescoping Sums Conference

Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.

@conference{beatson2019efficient,

title = {Efficient Optimization of Loops and Limits with Randomized Telescoping Sums},

author = {Alex Beatson and

Ryan P. Adams},

url = {https://www.cs.princeton.edu/~rpa/pubs/beatson2019efficient.pdf},

year = {2019},

date = {2019-06-13},

booktitle = {Proceedings of the 36th International Conference on Machine Learning (ICML)},

abstract = {We consider optimization problems in which the objective requires an inner loop with many steps or is the limit of a sequence of increasingly costly approximations. Meta-learning, training recurrent neural networks, and optimization of the solutions to differential equations are all examples of optimization problems with this character. In such problems, it can be expensive to compute the objective function value and its gradient, but truncating the loop or using less accurate approximations can induce biases that damage the overall solution. We propose randomized telescope (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates. We identify conditions under which RT estimators achieve optimization convergence rates independent of the length of the loop or the required accuracy of the approximation. We also derive a method for tuning RT estimators online to maximize a lower bound on the expected decrease in loss per unit of computation. We evaluate our adaptive RT estimators on a range of applications including meta-optimization of learning rates, variational inference of ODE parameters, and training an LSTM to model long sequences.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Zhou, Wenda; Veitch, Victor; Austern, Morgane; Adams, Ryan P.; Orbanz, Peter

Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach Conference

Proceedings of the Seventh International Conference on Learning Representations (ICLR), 2019.

@conference{zhou2019nonvacuous,

title = {Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach},

author = {Wenda Zhou and

Victor Veitch and

Morgane Austern and

Ryan P. Adams and

Peter Orbanz},

url = {https://www.cs.princeton.edu/~rpa/pubs/zhou2019nonvacuous.pdf},

year = {2019},

date = {2019-04-18},

booktitle = {Proceedings of the Seventh International Conference on Learning Representations (ICLR)},

abstract = {Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be "compressed" to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size. Combined with off-the-shelf compression algorithms, the bound leads to state of the art generalization guarantees; in particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. As additional evidence connecting compression and generalization, we show that compressibility of models that tend to overfit is limited: We establish an absolute limit on expected compressibility as a function of expected generalization error, where the expectations are over the random choice of training examples. The bounds are complemented by empirical results that show an increase in overfitting implies an increase in the number of bits required to describe a trained network.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

Wei, Jennifer N.; Belanger, David; Adams, Ryan P.; Sculley, D.

Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks Journal Article

In: ACS Central Science, vol. 5, no. 4, pp. 700-708, 2019.

@article{wei2019rapid,

title = {Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks},

author = {Jennifer N. Wei and

David Belanger and

Ryan P. Adams and

D. Sculley},

url = {https://www.cs.princeton.edu/~rpa/pubs/wei2019rapid.pdf},

year = {2019},

date = {2019-03-19},

journal = {ACS Central Science},

volume = {5},

number = {4},

pages = {700-708},

abstract = {When confronted with a substance of unknown identity, researchers often perform mass spectrometry on the sample and compare the observed spectrum to a library of previously collected spectra to identify the molecule. While popular, this approach will fail to identify molecules that are not in the existing library. In response, we propose to improve the library’s coverage by augmenting it with synthetic spectra that are predicted from candidate molecules using machine learning. We contribute a lightweight neural network model that quickly predicts mass spectra for small molecules, averaging 5 ms per molecule with a recall-at-10 accuracy of 91.8%. Achieving high-accuracy predictions requires a novel neural network architecture that is designed to capture typical fragmentation patterns from electron ionization. We analyze the effects of our modeling innovations on library matching performance and compare our models to prior machine-learning-based work on spectrum prediction.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Gilmer, Justin; Adams, Ryan P.; Goodfellow, Ian; Andersen, David; Dahl, George E.

Motivating the Rules of the Game for Adversarial Example Research Technical Report

2018.

@techreport{gilmer2018adversarial,

title = {Motivating the Rules of the Game for Adversarial Example Research},

author = {Justin Gilmer and Ryan P. Adams and Ian Goodfellow and David Andersen and George E. Dahl},

url = {https://arxiv.org/abs/1807.06732},

year = {2018},

date = {2018-07-18},

abstract = {Advances in machine learning have led to broad deployment of systems with impressive performance on important problems. Nonetheless, these systems can be induced to make errors on data that are surprisingly similar to examples the learned system handles correctly. The existence of these errors raises a variety of questions about out-of-sample generalization and whether bad actors might use such examples to abuse deployed systems. As a result of these security concerns, there has been a flurry of recent papers proposing algorithms to defend against such malicious perturbations of correctly handled examples. It is unclear how such misclassifications represent a different kind of security problem than other errors, or even other attacker-produced examples that have no specific relationship to an uncorrupted input. In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security. Towards this end, we establish a taxonomy of motivations, constraints, and abilities for more plausible adversaries. Finally, we provide a series of recommendations outlining a path forward for future work to more clearly articulate the threat model and perform more meaningful evaluation.},

keywords = {},

pubstate = {published},

tppubtype = {techreport}

}

Saeedi, Ardavan; Hoffman, Matthew D.; DiVerdi, Stephen J.; Ghandeharioun, Asma; Johnson, Matthew J.; Adams, Ryan P.

Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models Conference

Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018, (arXiv:1704.04997 [stat.ML]).

@conference{saeedi2018multimodal,

title = {Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models},

author = {Ardavan Saeedi and Matthew D. Hoffman and Stephen J. DiVerdi and Asma Ghandeharioun and Matthew J. Johnson and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/saeedi2018multimodal.pdf},

year = {2018},

date = {2018-01-01},

booktitle = {Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS)},

abstract = {Professional-grade software applications are powerful but

complicated−expert users can achieve impressive results, but

novices often struggle to complete even basic tasks. Photo

editing is a prime example: after loading a photo, the user is

confronted with an array of cryptic sliders like "clarity",

"temp", and "highlights". An automatically generated

suggestion could help, but there is no single "correct" edit

for a given image−different experts may make very different

aesthetic decisions when faced with the same image, and a

single expert may make different choices depending on the

intended use of the image (or on a whim). We therefore want a

system that can propose multiple diverse, high-quality edits

while also learning from and adapting to a user's aesthetic

preferences. In this work, we develop a statistical model that

meets these objectives. Our model builds on recent advances in

neural network generative modeling and scalable inference, and

uses hierarchical structure to learn editing patterns across

many diverse users. Empirically, we find that our model

outperforms other approaches on this challenging multimodal

prediction task.},

note = {arXiv:1704.04997 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

complicated−expert users can achieve impressive results, but

novices often struggle to complete even basic tasks. Photo

editing is a prime example: after loading a photo, the user is

confronted with an array of cryptic sliders like "clarity",

"temp", and "highlights". An automatically generated

suggestion could help, but there is no single "correct" edit

for a given image−different experts may make very different

aesthetic decisions when faced with the same image, and a

single expert may make different choices depending on the

intended use of the image (or on a whim). We therefore want a

system that can propose multiple diverse, high-quality edits

while also learning from and adapting to a user's aesthetic

preferences. In this work, we develop a statistical model that

meets these objectives. Our model builds on recent advances in

neural network generative modeling and scalable inference, and

uses hierarchical structure to learn editing patterns across

many diverse users. Empirically, we find that our model

outperforms other approaches on this challenging multimodal

prediction task.

Duvenaud, David; Maclaurin, Dougal; Adams, Ryan P.

Early Stopping is Nonparametric Variational Inference Conference

Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2016, (arXiv:1504.01344 [stat.ML]).

@conference{duvenaud2016early,

title = {Early Stopping is Nonparametric Variational Inference},

author = {David Duvenaud and Dougal Maclaurin and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/duvenaud2016early.pdf},

year = {2016},

date = {2016-01-01},

booktitle = {Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)},

abstract = {We show that unconverged stochastic gradient descent can be

interpreted as a procedure that samples from a nonparametric

variational approximate posterior distribution. This

distribution is implicitly defined as the transformation of an

initial distribution by a sequence of optimization updates. By

tracking the change in entropy over this sequence of

transformations during optimization, we form a scalable,

unbiased estimate of the variational lower bound on the log

marginal likelihood. We can use this bound to optimize

hyperparameters instead of using cross-validation. This

Bayesian interpretation of SGD suggests improved,

overfitting-resistant optimization procedures, and gives a

theoretical foundation for popular tricks such as early

stopping and ensembling. We investigate the properties of this

marginal likelihood estimator on neural network models.},

note = {arXiv:1504.01344 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

interpreted as a procedure that samples from a nonparametric

variational approximate posterior distribution. This

distribution is implicitly defined as the transformation of an

initial distribution by a sequence of optimization updates. By

tracking the change in entropy over this sequence of

transformations during optimization, we form a scalable,

unbiased estimate of the variational lower bound on the log

marginal likelihood. We can use this bound to optimize

hyperparameters instead of using cross-validation. This

Bayesian interpretation of SGD suggests improved,

overfitting-resistant optimization procedures, and gives a

theoretical foundation for popular tricks such as early

stopping and ensembling. We investigate the properties of this

marginal likelihood estimator on neural network models.

Johnson, Matthew J.; Duvenaud, David; Wiltschko, Alexander B.; Datta, Sandeep Robert; Adams, Ryan P.

Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference Conference

Advances in Neural Information Processing Systems (NIPS) 29, 2016, (arXiv:1603.06277 [stat.ML]).

@conference{johnson2016svae,

title = {Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference},

author = {Matthew J. Johnson and David Duvenaud and Alexander B. Wiltschko and Sandeep Robert Datta and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/johnson2016svae.pdf},

year = {2016},

date = {2016-01-01},

booktitle = {Advances in Neural Information Processing Systems (NIPS) 29},

abstract = {We propose a general modeling and inference framework that

composes probabilistic graphical models with deep learning

methods and combines their respective strengths. Our model

family augments graphical structure in latent variables with

neural network observation models. For inference, we extend

variational autoencoders to use graphical model approximating

distributions with recognition networks that output conjugate

potentials. All components of these models are learned

simultaneously with a single objective, giving a scalable

algorithm that leverages stochastic variational inference,

natural gradients, graphical model message passing, and the

reparameterization trick. We illustrate this framework with

several example models and an application to mouse behavioral

phenotyping.},

note = {arXiv:1603.06277 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

composes probabilistic graphical models with deep learning

methods and combines their respective strengths. Our model

family augments graphical structure in latent variables with

neural network observation models. For inference, we extend

variational autoencoders to use graphical model approximating

distributions with recognition networks that output conjugate

potentials. All components of these models are learned

simultaneously with a single objective, giving a scalable

algorithm that leverages stochastic variational inference,

natural gradients, graphical model message passing, and the

reparameterization trick. We illustrate this framework with

several example models and an application to mouse behavioral

phenotyping.

Nemati, Shamim; Adams, Ryan P.

Identifying Outcome-Discriminative Dynamics in Multivariate Physiological Cohort Time Series Book Chapter

In: Advanced State Space Methods for Neural and Clinical Data, Cambridge University Press, Cambridge, UK, 2015.

@inbook{nemati2015identifying,

title = {Identifying Outcome-Discriminative Dynamics in Multivariate Physiological Cohort Time Series},

author = {Shamim Nemati and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/nemati2015identifying.pdf},

year = {2015},

date = {2015-01-01},

booktitle = {Advanced State Space Methods for Neural and Clinical Data},

publisher = {Cambridge University Press},

address = {Cambridge, UK},

abstract = {In this chapter, we present a learning algorithm specifically

designed to learn dynam- ical features of time series that are

directly predictive of the associated labels. Rather than

depending on label-free unsupervised learning to discover

relevant features of the time series, we build a system that

expressly learns the dynamics that are most rele- vant for

classifying time series labels. Our goal is to obtain compact

representations of nonstationary and multivariate time series

(representation learning)(Bengio, Courville & Vincent

2013). To accomplish this we use a connection between dynamic

bayesian networks (e.g., the switching VAR model) and

artificial neural networks (ANNs) to perform inference and

learning in state-space models in a manner analogous to back-

propagation in neural networks (Rumelhart, Hinton & Williams

1988). This connection stems from the observation that the

directed acyclic graph structure of a state-space model can be

unrolled both as a function of time and inference steps to

yield a deter- ministic neural network with efficient

parameter tying across time (see Fig. 1.2). Thus, the

parameters governing the dynamics and observation model of a

state-space model can be learned in a manner analogous to that

of a neural network. Indeed, the resulting system can be

viewed as a compactly-parameterized recurrent neural network

(RNN) (Sutskever 2013). Although the standard use of RNNs has

been for time series pre- diction (network output is the

predicted input time series in the future) or sequential

labeling (when output is a label sequence associated with the

input data sequence), with additional processing layers one

may obtain a time series classifier from this class of models

(Graves, Ferna ́ndez, Gomez & Schmidhuber 2006). Nevertheless,

RNNs have proven hard to train, since the optimization surface

tend to include multiple local min- ima. Moreover, standard

RNN are ’black box’ algorithms(as apposed to ’model-based’)

and therefore do allow for incorporation of physiological

models of the underlying sys- tems. The framework proposed

here addresses both these shortcomings. First, knowl- edge of

the underlying physiology can be directly incorporated into

the state-space mod- els that constitute the basic building

blocks of a dynamic Bayesian network. Secondly, equipped with

a generative model, we can rely on unsupervised pre-training

(via expec- tation maximization) to systematically initialize

the parameters of the equivalent RNN; in a manner analogous to

pre-training of very large neural networks (deep learning)

(Erhan, Bengio, Courville, Manzagol, Vincent & Bengio 2010).},

keywords = {},

pubstate = {published},

tppubtype = {inbook}

}

designed to learn dynam- ical features of time series that are

directly predictive of the associated labels. Rather than

depending on label-free unsupervised learning to discover

relevant features of the time series, we build a system that

expressly learns the dynamics that are most rele- vant for

classifying time series labels. Our goal is to obtain compact

representations of nonstationary and multivariate time series

(representation learning)(Bengio, Courville & Vincent

2013). To accomplish this we use a connection between dynamic

bayesian networks (e.g., the switching VAR model) and

artificial neural networks (ANNs) to perform inference and

learning in state-space models in a manner analogous to back-

propagation in neural networks (Rumelhart, Hinton & Williams

1988). This connection stems from the observation that the

directed acyclic graph structure of a state-space model can be

unrolled both as a function of time and inference steps to

yield a deter- ministic neural network with efficient

parameter tying across time (see Fig. 1.2). Thus, the

parameters governing the dynamics and observation model of a

state-space model can be learned in a manner analogous to that

of a neural network. Indeed, the resulting system can be

viewed as a compactly-parameterized recurrent neural network

(RNN) (Sutskever 2013). Although the standard use of RNNs has

been for time series pre- diction (network output is the

predicted input time series in the future) or sequential

labeling (when output is a label sequence associated with the

input data sequence), with additional processing layers one

may obtain a time series classifier from this class of models

(Graves, Ferna ́ndez, Gomez & Schmidhuber 2006). Nevertheless,

RNNs have proven hard to train, since the optimization surface

tend to include multiple local min- ima. Moreover, standard

RNN are ’black box’ algorithms(as apposed to ’model-based’)

and therefore do allow for incorporation of physiological

models of the underlying sys- tems. The framework proposed

here addresses both these shortcomings. First, knowl- edge of

the underlying physiology can be directly incorporated into

the state-space mod- els that constitute the basic building

blocks of a dynamic Bayesian network. Secondly, equipped with

a generative model, we can rely on unsupervised pre-training

(via expec- tation maximization) to systematically initialize

the parameters of the equivalent RNN; in a manner analogous to

pre-training of very large neural networks (deep learning)

(Erhan, Bengio, Courville, Manzagol, Vincent & Bengio 2010).

Maclaurin, Dougal; Duvenaud, David; Adams, Ryan P.

Gradient-based Hyperparameter Optimization through Reversible Learning Conference

Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, (arXiv:1502.03492 [stat.ML]).

@conference{maclaurin2015reversible,

title = {Gradient-based Hyperparameter Optimization through Reversible Learning},

author = {Dougal Maclaurin and David Duvenaud and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/maclaurin2015reversible.pdf},

year = {2015},

date = {2015-01-01},

booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},

abstract = {Tuning hyperparameters of learning algorithms is hard because

gradients are usually unavailable. We compute exact gradients

of cross-validation performance with respect to all

hyperparameters by chaining derivatives backwards through the

entire training procedure. These gradients allow us to

optimize thousands of hyperparameters, including step-size and

momentum schedules, weight initialization distributions,

richly parameterized regularization schemes, and neural

network architectures. We compute hyperparameter gradients by

exactly reversing the dynamics of stochastic gradient descent

with momentum.},

note = {arXiv:1502.03492 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

gradients are usually unavailable. We compute exact gradients

of cross-validation performance with respect to all

hyperparameters by chaining derivatives backwards through the

entire training procedure. These gradients allow us to

optimize thousands of hyperparameters, including step-size and

momentum schedules, weight initialization distributions,

richly parameterized regularization schemes, and neural

network architectures. We compute hyperparameter gradients by

exactly reversing the dynamics of stochastic gradient descent

with momentum.

Snoek, Jasper; Rippel, Oren; Swersky, Kevin; Kiros, Ryan; Satish, Nadathur; Sundaram, Narayanan; Patwary, Md. Mostofa Ali; Prabhat,; Adams, Ryan P.

Scalable Bayesian Optimization Using Deep Neural Networks Conference

Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, (arXiv:1502.05700 [stat.ML]).

@conference{snoek2015scalable,

title = {Scalable Bayesian Optimization Using Deep Neural Networks},

author = {Jasper Snoek and Oren Rippel and Kevin Swersky and Ryan Kiros and Nadathur Satish and Narayanan Sundaram and Md. Mostofa Ali Patwary and Prabhat and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/snoek2015scalable.pdf},

year = {2015},

date = {2015-01-01},

booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},

abstract = {Bayesian optimization is an effective methodology for the global

optimization of functions with expensive evaluations. It

relies on querying a distribution over functions defined by a

relatively cheap surrogate model. An accurate model for this

distribution over functions is critical to the effectiveness

of the approach, and is typically fit using Gaussian processes

(GPs). However, since GPs scale cubically with the number of

observations, it has been challenging to handle objectives

whose optimization requires many evaluations, and as such,

massively parallelizing the optimization. In this work, we

explore the use of neural networks as an alternative to GPs to

model distributions over functions. We show that performing

adaptive basis function regression with a neural network as

the parametric form performs competitively with

state-of-the-art GP-based approaches, but scales linearly with

the number of data rather than cubically. This allows us to

achieve a previously intractable degree of parallelism, which

we apply to large scale hyperparameter optimization, rapidly

finding competitive models on benchmark object recognition

tasks using convolutional networks, and image caption

generation using neural language models.},

note = {arXiv:1502.05700 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

optimization of functions with expensive evaluations. It

relies on querying a distribution over functions defined by a

relatively cheap surrogate model. An accurate model for this

distribution over functions is critical to the effectiveness

of the approach, and is typically fit using Gaussian processes

(GPs). However, since GPs scale cubically with the number of

observations, it has been challenging to handle objectives

whose optimization requires many evaluations, and as such,

massively parallelizing the optimization. In this work, we

explore the use of neural networks as an alternative to GPs to

model distributions over functions. We show that performing

adaptive basis function regression with a neural network as

the parametric form performs competitively with

state-of-the-art GP-based approaches, but scales linearly with

the number of data rather than cubically. This allows us to

achieve a previously intractable degree of parallelism, which

we apply to large scale hyperparameter optimization, rapidly

finding competitive models on benchmark object recognition

tasks using convolutional networks, and image caption

generation using neural language models.

Hernández-Lobato, José Miguel; Adams, Ryan P.

Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks Conference

Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, (arXiv:1502.05336 [stat.ML]).

@conference{lobato2015probabilistic,

title = {Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks},

author = {José Miguel Hernández-Lobato and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/lobato2015probabilistic.pdf},

year = {2015},

date = {2015-01-01},

booktitle = {Proceedings of the 32nd International Conference on Machine Learning (ICML)},

abstract = {Large multilayer neural networks trained with backpropagation

have recently achieved state-of-the-art results in a wide

range of problems. However, using backprop for neural net

learning still has some disadvantages, e.g., having to tune a

large number of hyperparameters to the data, lack of

calibrated probabilistic predictions, and a tendency to

overfit the training data. In principle, the Bayesian approach

to learning neural networks does not have these

problems. However, existing Bayesian techniques lack

scalability to large dataset and network sizes. In this work

we present a novel scalable method for learning Bayesian

neural networks, called probabilistic backpropagation

(PBP). Similar to classical backpropagation, PBP works by

computing a forward propagation of probabilities through the

network and then doing a backward computation of gradients. A

series of experiments on ten real-world datasets show that PBP

is significantly faster than other techniques, while offering

competitive predictive abilities. Our experiments also show

that PBP provides accurate estimates of the posterior variance

on the network weights.},

note = {arXiv:1502.05336 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

have recently achieved state-of-the-art results in a wide

range of problems. However, using backprop for neural net

learning still has some disadvantages, e.g., having to tune a

large number of hyperparameters to the data, lack of

calibrated probabilistic predictions, and a tendency to

overfit the training data. In principle, the Bayesian approach

to learning neural networks does not have these

problems. However, existing Bayesian techniques lack

scalability to large dataset and network sizes. In this work

we present a novel scalable method for learning Bayesian

neural networks, called probabilistic backpropagation

(PBP). Similar to classical backpropagation, PBP works by

computing a forward propagation of probabilities through the

network and then doing a backward computation of gradients. A

series of experiments on ten real-world datasets show that PBP

is significantly faster than other techniques, while offering

competitive predictive abilities. Our experiments also show

that PBP provides accurate estimates of the posterior variance

on the network weights.

Rippel, Oren; Snoek, Jasper; Adams, Ryan P.

Spectral Representations for Convolutional Neural Networks Conference

Advances in Neural Information Processing Systems (NIPS) 28, 2015, (arXiv:1506.03767 [stat.ML]).

@conference{rippel2015spectral,

title = {Spectral Representations for Convolutional Neural Networks},

author = {Oren Rippel and Jasper Snoek and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/rippel2015spectral.pdf},

year = {2015},

date = {2015-01-01},

booktitle = {Advances in Neural Information Processing Systems (NIPS) 28},

abstract = {Discrete Fourier transforms provide a significant speedup in the

computation of convolutions in deep learning. In this work, we

demonstrate that, beyond its advantages for efficient

computation, the spectral domain also provides a powerful

representation in which to model and train convolutional

neural networks (CNNs). We employ spectral representations to

introduce a number of innovations to CNN design. First, we

propose spectral pooling, which performs dimensionality

reduction by truncating the representation in the frequency

domain. This approach preserves considerably more information

per parameter than other pooling strategies and enables

flexibility in the choice of pooling output

dimensionality. This representation also enables a new form of

stochastic regularization by randomized modification of

resolution. We show that these methods achieve competitive

results on classification and approximation tasks, without

using any dropout or max-pooling. Finally, we demonstrate the

effectiveness of complex-coefficient spectral parameterization

of convolutional filters. While this leaves the underlying

model unchanged, it results in a representation that greatly

facilitates optimization. We observe on a variety of popular

CNN configurations that this leads to significantly faster

convergence during training.},

note = {arXiv:1506.03767 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

computation of convolutions in deep learning. In this work, we

demonstrate that, beyond its advantages for efficient

computation, the spectral domain also provides a powerful

representation in which to model and train convolutional

neural networks (CNNs). We employ spectral representations to

introduce a number of innovations to CNN design. First, we

propose spectral pooling, which performs dimensionality

reduction by truncating the representation in the frequency

domain. This approach preserves considerably more information

per parameter than other pooling strategies and enables

flexibility in the choice of pooling output

dimensionality. This representation also enables a new form of

stochastic regularization by randomized modification of

resolution. We show that these methods achieve competitive

results on classification and approximation tasks, without

using any dropout or max-pooling. Finally, we demonstrate the

effectiveness of complex-coefficient spectral parameterization

of convolutional filters. While this leaves the underlying

model unchanged, it results in a representation that greatly

facilitates optimization. We observe on a variety of popular

CNN configurations that this leads to significantly faster

convergence during training.

Duvenaud, David; Maclaurin, Dougal; Aguilera-Iparraguirre, Jorge; Gómez-Bombarelli, Rafael; Hirzel, Timothy D.; Aspuru-Guzik, Alan; Adams, Ryan P.

Convolutional Networks on Graphs for Learning Molecular Fingerprints Conference

Advances in Neural Information Processing Systems (NIPS) 28, 2015, (arXiv:1509.09292 [stat.ML]).

@conference{duvenaud2015fingerprints,

title = {Convolutional Networks on Graphs for Learning Molecular Fingerprints},

author = {David Duvenaud and Dougal Maclaurin and Jorge Aguilera-Iparraguirre and Rafael Gómez-Bombarelli and Timothy D. Hirzel and Alan Aspuru-Guzik and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/duvenaud2015fingerprints.pdf},

year = {2015},

date = {2015-01-01},

booktitle = {Advances in Neural Information Processing Systems (NIPS) 28},

abstract = {We introduce a convolutional neural network that operates

directly on graphs. These networks allow end-to-end learning

of prediction pipelines whose inputs are graphs of arbitrary

size and shape. The architecture we present generalizes

standard molecular feature extraction methods based on

circular fingerprints. We show that these data-driven features

are more interpretable, and have better predictive performance

on a variety of tasks.},

note = {arXiv:1509.09292 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

directly on graphs. These networks allow end-to-end learning

of prediction pipelines whose inputs are graphs of arbitrary

size and shape. The architecture we present generalizes

standard molecular feature extraction methods based on

circular fingerprints. We show that these data-driven features

are more interpretable, and have better predictive performance

on a variety of tasks.

Duvenaud, David; Rippel, Oren; Adams, Ryan P.; Ghahramani, Zoubin

Avoiding Pathologies in Very Deep Networks Conference

Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 2014, (arXiv:1402.5836 [stat.ML]).

@conference{duvenaud2014pathologies,

title = {Avoiding Pathologies in Very Deep Networks},

author = {David Duvenaud and Oren Rippel and Ryan P. Adams and Zoubin Ghahramani},

url = {http://www.cs.princeton.edu/~rpa/pubs/duvenaud2014pathologies.pdf},

year = {2014},

date = {2014-01-01},

booktitle = {Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS)},

abstract = {Choosing appropriate architectures and regularization

strategies for deep networks is crucial to good predictive

performance. To shed light on this problem, we analyze the

analogous problem of constructing useful priors on

compositions of functions. Specifically, we study the deep

Gaussian process, a type of infinitely-wide, deep neural

network. We show that in standard architectures, the

representational capacity of the network tends to capture

fewer degrees of freedom as the number of layers increases,

retaining only a single degree of freedom in the limit. We

propose an alternate network architecture which does not

suffer from this pathology. We also examine deep covariance

functions, obtained by composing infinitely many feature

transforms. Lastly, we characterize the class of models

obtained by performing dropout on Gaussian processes.},

note = {arXiv:1402.5836 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

strategies for deep networks is crucial to good predictive

performance. To shed light on this problem, we analyze the

analogous problem of constructing useful priors on

compositions of functions. Specifically, we study the deep

Gaussian process, a type of infinitely-wide, deep neural

network. We show that in standard architectures, the

representational capacity of the network tends to capture

fewer degrees of freedom as the number of layers increases,

retaining only a single degree of freedom in the limit. We

propose an alternate network architecture which does not

suffer from this pathology. We also examine deep covariance

functions, obtained by composing infinitely many feature

transforms. Lastly, we characterize the class of models

obtained by performing dropout on Gaussian processes.

Rippel, Oren; Gelbart, Michael A.; Adams, Ryan P.

Learning Ordered Representations with Nested Dropout Conference

Proceedings of the 31st International Conference on Machine Learning (ICML), 2014, (arXiv:1402.0915 [stat.ML]).

@conference{rippel2014nested,

title = {Learning Ordered Representations with Nested Dropout},

author = {Oren Rippel and Michael A. Gelbart and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/rippel2014nested.pdf},

year = {2014},

date = {2014-01-01},

booktitle = {Proceedings of the 31st International Conference on Machine Learning (ICML)},

abstract = {In this paper, we study ordered representations of data in

which different dimensions have different degrees of

importance. To learn these representations we introduce nested

dropout, a procedure for stochastically removing coherent

nested sets of hidden units in a neural network. We first

present a sequence of theoretical results in the simple case

of a semi-linear autoencoder. We rigorously show that the

application of nested dropout enforces identifiability of the

units, which leads to an exact equivalence with PCA. We then

extend the algorithm to deep models and demonstrate the

relevance of ordered representations to a number of

applications. Specifically, we use the ordered property of the

learned codes to construct hash-based data structures that

permit very fast retrieval, achieving retrieval in time

logarithmic in the database size and independent of the

dimensionality of the representation. This allows codes that

are hundreds of times longer than currently feasible for

retrieval. We therefore avoid the diminished quality

associated with short codes, while still performing retrieval

that is competitive in speed with existing methods. We also

show that ordered representations are a promising way to learn

adaptive compression for efficient online data

reconstruction.},

note = {arXiv:1402.0915 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

which different dimensions have different degrees of

importance. To learn these representations we introduce nested

dropout, a procedure for stochastically removing coherent

nested sets of hidden units in a neural network. We first

present a sequence of theoretical results in the simple case

of a semi-linear autoencoder. We rigorously show that the

application of nested dropout enforces identifiability of the

units, which leads to an exact equivalence with PCA. We then

extend the algorithm to deep models and demonstrate the

relevance of ordered representations to a number of

applications. Specifically, we use the ordered property of the

learned codes to construct hash-based data structures that

permit very fast retrieval, achieving retrieval in time

logarithmic in the database size and independent of the

dimensionality of the representation. This allows codes that

are hundreds of times longer than currently feasible for

retrieval. We therefore avoid the diminished quality

associated with short codes, while still performing retrieval

that is competitive in speed with existing methods. We also

show that ordered representations are a promising way to learn

adaptive compression for efficient online data

reconstruction.

Rippel, Oren; Adams, Ryan P.

High-Dimensional Probability Estimation with Deep Density Models Unpublished

2013, (arXiv:1302.5125 [stat.ML]).

@unpublished{rippel2013density,

title = {High-Dimensional Probability Estimation with Deep Density Models},

author = {Oren Rippel and Ryan P. Adams},

url = {http://www.cs.princeton.edu/~rpa/pubs/rippel2013density.pdf},

year = {2013},

date = {2013-01-01},

abstract = {One of the fundamental problems in machine learning is the

estimation of a probability distribution from data. Many

techniques have been proposed to study the structure of data,

most often building around the assumption that observations

lie on a lower-dimensional manifold of high probability. It

has been more difficult, however, to exploit this insight to

build explicit, tractable density models for high-dimensional

data. In this paper, we introduce the deep density model

(DDM), a new approach to density estimation. We exploit

insights from deep learning to construct a bijective map to a

representation space, under which the transformation of the

distribution of the data is approximately factorized and has

identical and known marginal densities. The simplicity of the

latent distribution under the model allows us to feasibly

explore it, and the invertibility of the map to characterize

contraction of measure across it. This enables us to compute

normalized densities for out-of-sample data. This combination

of tractability and flexibility allows us to tackle a variety

of probabilistic tasks on high-dimensional datasets,

including: rapid computation of normalized densities at

test-time without evaluating a partition function; generation

of samples without MCMC; and characterization of the joint

entropy of the data.},

note = {arXiv:1302.5125 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {unpublished}

}

estimation of a probability distribution from data. Many

techniques have been proposed to study the structure of data,

most often building around the assumption that observations

lie on a lower-dimensional manifold of high probability. It

has been more difficult, however, to exploit this insight to

build explicit, tractable density models for high-dimensional

data. In this paper, we introduce the deep density model

(DDM), a new approach to density estimation. We exploit

insights from deep learning to construct a bijective map to a

representation space, under which the transformation of the

distribution of the data is approximately factorized and has

identical and known marginal densities. The simplicity of the

latent distribution under the model allows us to feasibly

explore it, and the invertibility of the map to characterize

contraction of measure across it. This enables us to compute

normalized densities for out-of-sample data. This combination

of tractability and flexibility allows us to tackle a variety

of probabilistic tasks on high-dimensional datasets,

including: rapid computation of normalized densities at

test-time without evaluating a partition function; generation

of samples without MCMC; and characterization of the joint

entropy of the data.

Snoek, Jasper; Adams, Ryan P.; Larochelle, Hugo

On Nonparametric Guidance for Learning Autoencoder Representations Conference

Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012, (arXiv:1102.1492v4 [stat.ML]).

@conference{snoek2012guidance,

title = {On Nonparametric Guidance for Learning Autoencoder Representations},

author = {Jasper Snoek and Ryan P. Adams and Hugo Larochelle},

url = {http://www.cs.princeton.edu/~rpa/pubs/snoek2012guidance.pdf},

year = {2012},

date = {2012-01-01},

booktitle = {Proceedings of the 15th International Conference on Artificial

Intelligence and Statistics (AISTATS)},

abstract = {Unsupervised discovery of latent representations, in addition

to being useful for density modeling, visualisation and

exploratory data analysis, is also increasingly important for

learning features relevant to discriminative

tasks. Autoencoders, in particular, have proven to be an

effective way to learn latent codes that reflect meaningful

variations in data. A continuing challenge, however, is

guiding an autoencoder toward representations that are useful

for particular tasks. A complementary challenge is to find

codes that are invariant to irrelevant transformations of the

data. The most common way of introducing such problem-specific

guidance in autoencoders has been through the incorporation of

a parametric component that ties the latent representation to

the label information. In this work, we argue that a

preferable approach relies instead on a nonparametric guidance

mechanism. Conceptually, it ensures that there exists a

function that can predict the label information, without

explicitly instantiating that function. The superiority of

this guidance mechanism is confirmed on two datasets. In

particular, this approach is able to incorporate invariance

information (lighting, elevation, etc.) from the small NORB

object recognition dataset and yields state-of-the-art

performance for a single layer, non-convolutional network.},

note = {arXiv:1102.1492v4 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

to being useful for density modeling, visualisation and

exploratory data analysis, is also increasingly important for

learning features relevant to discriminative

tasks. Autoencoders, in particular, have proven to be an

effective way to learn latent codes that reflect meaningful

variations in data. A continuing challenge, however, is

guiding an autoencoder toward representations that are useful

for particular tasks. A complementary challenge is to find

codes that are invariant to irrelevant transformations of the

data. The most common way of introducing such problem-specific

guidance in autoencoders has been through the incorporation of

a parametric component that ties the latent representation to

the label information. In this work, we argue that a

preferable approach relies instead on a nonparametric guidance

mechanism. Conceptually, it ensures that there exists a

function that can predict the label information, without

explicitly instantiating that function. The superiority of

this guidance mechanism is confirmed on two datasets. In

particular, this approach is able to incorporate invariance

information (lighting, elevation, etc.) from the small NORB

object recognition dataset and yields state-of-the-art

performance for a single layer, non-convolutional network.

Dahl, George E.; Adams, Ryan P.; Larochelle, Hugo

Training Restricted Boltzmann Machines on Word Observations Conference

Proceedings of the 29th International Conference on Machine Learning (ICML), 2012, (arXiv:1202.5695 [cs.LG]).

@conference{dahl2012training,

title = {Training Restricted Boltzmann Machines on Word Observations},

author = {George E. Dahl and Ryan P. Adams and Hugo Larochelle},

url = {http://www.cs.princeton.edu/~rpa/pubs/dahl2012training.pdf},

year = {2012},

date = {2012-01-01},

booktitle = {Proceedings of the 29th International Conference on Machine

Learning (ICML)},

abstract = {The restricted Boltzmann machine (RBM) is a flexible model for

complex data. However, using RBMs for high-dimensional

multinomial observations poses significant computational

di␣culties. In natural language processing applications, words

are naturally modeled by K-ary discrete distributions, where K

is determined by the vocabulary size and can easily be in the

hundred thousands. The conventional approach to training RBMs

on word observations is limited because it requires sampling

the states of K-way softmax visible units during block Gibbs

updates, an operation that takes time linear in K. In this

work, we address this issue with a more general class of

Markov chain Monte Carlo operators on the visible units,

yielding updates with computational complexity independent of

K. We demonstrate the success of our approach by training RBMs

on hundreds of millions of word n-grams using larger

vocabularies than previously feasible with RBMs and by using

the learned features to improve performance on chunking and

sentiment classification tasks, achieving state-of-the-art

results on the latter.},

note = {arXiv:1202.5695 [cs.LG]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

complex data. However, using RBMs for high-dimensional

multinomial observations poses significant computational

di␣culties. In natural language processing applications, words

are naturally modeled by K-ary discrete distributions, where K

is determined by the vocabulary size and can easily be in the

hundred thousands. The conventional approach to training RBMs

on word observations is limited because it requires sampling

the states of K-way softmax visible units during block Gibbs

updates, an operation that takes time linear in K. In this

work, we address this issue with a more general class of

Markov chain Monte Carlo operators on the visible units,

yielding updates with computational complexity independent of

K. We demonstrate the success of our approach by training RBMs

on hundreds of millions of word n-grams using larger

vocabularies than previously feasible with RBMs and by using

the learned features to improve performance on chunking and

sentiment classification tasks, achieving state-of-the-art

results on the latter.

Snoek, Jasper; Adams, Ryan P.; Larochelle, Hugo

Nonparametric Guidance of Autoencoder Representations Using Label Information Journal Article

In: Journal of Machine Learning Research, vol. 13, pp. 2567–2588, 2012.

@article{snoek2012autoencoder,

title = {Nonparametric Guidance of Autoencoder Representations Using

Label Information},

author = {Jasper Snoek and Ryan P. Adams and Hugo Larochelle},

url = {http://www.cs.princeton.edu/~rpa/pubs/snoek2012autoencoder.pdf},

year = {2012},

date = {2012-01-01},

journal = {Journal of Machine Learning Research},

volume = {13},

pages = {2567--2588},

abstract = {While unsupervised learning has long been useful for density

modeling, exploratory data analysis and visualization, it has

become increasingly important for discovering features that

will later be used for discriminative tasks. Discriminative

algorithms often work best with highly-informative features;

remarkably, such features can often be learned without the

labels. One particularly effective way to perform such

unsupervised learning has been to use autoencoder neural

networks, which find latent representations that are

constrained but nevertheless informative for

reconstruction. However, pure unsupervised learning with

autoencoders can find representations that may or may not be

useful for the ultimate discriminative task. It is a

continuing challenge to guide the training of an autoencoder

so that it finds features which will be useful for predicting

labels. Similarly, we often have a priori information

regarding what statistical variation will be irrelevant to the

ultimate discriminative task, and we would like to be able to

use this for guidance as well. Although a typical strategy

would be to include a parametric discriminative model as part

of the autoencoder training, here we propose a nonparametric

approach that uses a Gaussian process to guide the

representation. By using a nonparametric model, we can ensure

that a useful discriminative function exists for a given set

of features, without explicitly instantiating it. We

demonstrate the superiority of this guidance mechanism on four

data sets, including a real-world application to

rehabilitation research. We also show how our proposed

approach can learn to explicitly ignore statistically

significant covariate information that is label-irrelevant, by

evaluating on the small NORB image recognition problem in

which pose and lighting labels are available.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

modeling, exploratory data analysis and visualization, it has

become increasingly important for discovering features that

will later be used for discriminative tasks. Discriminative

algorithms often work best with highly-informative features;

remarkably, such features can often be learned without the

labels. One particularly effective way to perform such

unsupervised learning has been to use autoencoder neural

networks, which find latent representations that are

constrained but nevertheless informative for

reconstruction. However, pure unsupervised learning with

autoencoders can find representations that may or may not be

useful for the ultimate discriminative task. It is a

continuing challenge to guide the training of an autoencoder

so that it finds features which will be useful for predicting

labels. Similarly, we often have a priori information

regarding what statistical variation will be irrelevant to the

ultimate discriminative task, and we would like to be able to

use this for guidance as well. Although a typical strategy

would be to include a parametric discriminative model as part

of the autoencoder training, here we propose a nonparametric

approach that uses a Gaussian process to guide the

representation. By using a nonparametric model, we can ensure

that a useful discriminative function exists for a given set

of features, without explicitly instantiating it. We

demonstrate the superiority of this guidance mechanism on four

data sets, including a real-world application to

rehabilitation research. We also show how our proposed

approach can learn to explicitly ignore statistically

significant covariate information that is label-irrelevant, by

evaluating on the small NORB image recognition problem in

which pose and lighting labels are available.

Adams, Ryan P.; Wallach, Hanna M.; Ghahramani, Zoubin

Learning the Structure of Deep Sparse Graphical Models Conference

Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, (arXiv:1001.0160 [stat.ML]).

@conference{adams2010deep,

title = {Learning the Structure of Deep Sparse Graphical Models},

author = {Ryan P. Adams and Hanna M. Wallach and Zoubin Ghahramani},

url = {http://www.cs.princeton.edu/~rpa/pubs/adams2010deep.pdf},

year = {2010},

date = {2010-01-01},

booktitle = {Proceedings of the 13th International Conference on Artificial

Intelligence and Statistics (AISTATS)},

abstract = {Deep belief networks are a powerful way to model complex

probability distributions. However, learning the structure of

a belief network, particularly one with hidden units, is

difficult. The Indian buffet process has been used as a

nonparametric Bayesian prior on the directed structure of a

belief network with a single infinitely wide hidden layer. In

this paper, we introduce the cascading Indian buffet process

(CIBP), which provides a nonparametric prior on the structure

of a layered, directed belief network that is unbounded in

both depth and width, yet allows tractable inference. We use

the CIBP prior with the nonlinear Gaussian belief network so

each unit can additionally vary its behavior between discrete

and continuous representations. We provide Markov chain Monte

Carlo algorithms for inference in these belief networks and

explore the structures learned on several image data sets.},

note = {arXiv:1001.0160 [stat.ML]},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}

probability distributions. However, learning the structure of

a belief network, particularly one with hidden units, is

difficult. The Indian buffet process has been used as a

nonparametric Bayesian prior on the directed structure of a

belief network with a single infinitely wide hidden layer. In

this paper, we introduce the cascading Indian buffet process

(CIBP), which provides a nonparametric prior on the structure

of a layered, directed belief network that is unbounded in

both depth and width, yet allows tractable inference. We use

the CIBP prior with the nonlinear Gaussian belief network so

each unit can additionally vary its behavior between discrete

and continuous representations. We provide Markov chain Monte

Carlo algorithms for inference in these belief networks and

explore the structures learned on several image data sets.