The log marginal likelihood is a central object for Bayesian inference with latent variable models: where are observations, are latent variables, and are parameters. Variational inference tackles this problem by approximating the posterior over with a simpler density . Often this density has a factored structure, for example. The approximating density is fit by maximizing a lower bound on the log marginal likelihood, or “evidence” (hence ELBO = evidence lower bound): The hope is that this will be a tight enough bound that we can use this as a proxy for the marginal likelihood when reasoning about . The ELBO is typically derived in one of two ways: via Jensen’s inequality or by writing down the … Read More
The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity.
We consider optimization problems in which the objective requires an inner loop with many steps or is the limit of a sequence of increasingly costly approximations. Meta-learning, training recurrent neural networks, and optimization of the solutions to differential equations are all examples of optimization problems with this character. In such problems, it can be expensive to compute the objective function value and its gradient, but truncating the loop or using less accurate approximations can induce biases that damage the overall solution. We propose randomized telescope (RT) gradient estimators, which represent the objective as the sum of a telescoping series and sample linear combinations of terms to provide cheap unbiased gradient estimates.
The count-min sketch is a time- and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n-grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation.
After several wonderful years at Harvard, and some fun times at Twitter and Google, I’ve moved to Princeton. I’ll miss all my amazing colleagues at Harvard and MIT, but I’m excited for the unique opportunities Princeton has to offer. I’ve renamed the group from the “Harvard Intelligent Probabilistic Systems” (HIPS) group to the “Laboratory for Intelligent Probabilistic Systems” (LIPS). (I should’ve listened to the advice of not putting the name of the university in the group name…) I’ve moved all the HIPS blog posts over to this new WordPress site, but I will keep the HIPS Github as that is where some well-known projects live, such as Autograd and Spearmint. For new projects, I’ve created a new repository at https://github.com/PrincetonLIPS. … Read More
One approach to AI research is to work directly on applications that matter — say, trying to improve production systems for speech recognition or medical imaging. But most research, even in applied fields like computer vision, is done on highly simplified proxies for the real world. Progress on object recognition benchmarks — from toy-ish ones like MNIST, NORB, and Caltech101, to complex and challenging ones like ImageNet and Pascal VOC — isn’t valuable in its own right, but only insofar as it yields insights that help us design better systems for real applications. So it’s natural to ask: which research results will generalize to new situations?