The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity.
We consider optimization problems in which the objective requires an inner loop with many steps or is the limit of a sequence of increasingly costly approximations. Meta-learning, training recurrent neural networks, and optimization of the solutions to differential equations are all examples of optimization problems with this character. In such problems, it can be expensive to compute the objective function value and its gradient, but truncating the loop or using less accurate approximations can induce biases that damage the overall solution. We propose randomized telescope (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates.
One approach to AI research is to work directly on applications that matter — say, trying to improve production systems for speech recognition or medical imaging. But most research, even in applied fields like computer vision, is done on highly simplified proxies for the real world. Progress on object recognition benchmarks — from toy-ish ones like MNIST, NORB, and Caltech101, to complex and challenging ones like ImageNet and Pascal VOC — isn’t valuable in its own right, but only insofar as it yields insights that help us design better systems for real applications. So it’s natural to ask: which research results will generalize to new situations?
When we talk about priors and regularization, we often motivate them in terms of “incorporating knowledge” or “preventing overfitting.” In a sense, the two are equivalent: any prior or regularizer must favor certain explanations relative to others, so favoring one explanation is equivalent to punishing others. But I’ll argue that these are two very different phenomena, and it’s useful to know which one is going on.
In this post, I’ll summarize one of my favorite papers from ICML 2013: Fast Dropout Training, by Sida Wang and Christopher Manning. This paper derives an analytic approximation to dropout, a randomized regularization method recently proposed for training deep nets that has allowed big improvements in predictive accuracy. Their approximation gives a roughly 10-times speedup under certain conditions. Much more interestingly, the authors also show strong connections to existing regularization methods, shedding light on why dropout works so well.
This post is taken from a tutorial I am writing with David Duvenaud. Overview When you write a nontrivial piece of software, how often do you get it completely correct on the first try? When you implement a machine learning algorithm, how thorough are your tests? If your answers are “rarely” and “not very,” stop and think about the implications. There’s a large literature on testing the convergence of optimization algorithms and MCMC samplers, but I want to talk about a more basic problem here: how to test if your code correctly implements the mathematical specification of an algorithm.
I’ve recently come across a fascinating blog post by Cambridge mathematician Tim Gowers. He and computational linguist Mohan Ganesalingam built a sort of automated mathematician which does the kind of “routine” mathematical proofs that mathematicians can do without backtracking. Their system was based on a formal theory of the semantics of mathematical language, together with introspection into how they solved problems. In other words, they worked through lots of simple examples and checked that their AI could solve the problems in a way that was cognitively plausible. The goal wasn’t to build a useful system (standard theorem provers are way more powerful), but to provide insight into our problem solving process. This post reminded me that, while our field has … Read More
I often have a hard time understanding the terminology in machine learning, even after almost three years in the field. For example, what is a Deep Belief Network? I attended a whole summer school on Deep Learning, but I’m still not quite sure. I decided to take a leap of faith and assume this is not just because the Deep Belief Networks in my brain are not functioning properly (although I am sure this is a factor). So, I created a Machine Learning Glossary to try to define some of these terms. The glossary can be found here. I have tried to write in an unpretentious style, defining things systematically and leaving no “exercises to the reader”. I also have … Read More
Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function at some locations , and we’d like to predict the function’s value at some new location . Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case. I will assume that is an intrinsically stationary process. In other words, there exists some semivariogram such that Furthermore, I will assume that the process is isotropic, (i.e. that is a function only of ). As Andy described here, the existence … Read More
We’re just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling. International Conference on Artificial Intelligence and Statistics (AISTATS) in Scottsdale, AZ: April 29 – May 1, 2013 New England Machine Learning Day at Microsoft Research New England: May 1, 2013 First International Conference on Learning Representations (ICLR) in Scottsdale, AZ: May 2-4, 2013 Conference on Bayesian Nonparametrics, Amsterdam: June 10-14, 2013 Conference on Learning Theory (COLT), Princeton, NJ: June 12-14, 2013 International Conference on Machine Learning (ICML), Atlanta, GA: June 16-21, 2013 Conference on Uncertainty in Artificial Intelligence (UAI), Bellvue, WA: July 11-15, 2013 Joint Statistical Meetings (JSM), Montreal: August 3-8, … Read More