Which research results will generalize?

One approach to AI research is to work directly on applications that matter — say, trying to improve production systems for speech recognition or medical imaging. But most research, even in applied fields like computer vision, is done on highly simplified proxies for the real world. Progress on object recognition benchmarks — from toy-ish ones like MNIST, NORB, and Caltech101, to complex and challenging ones like ImageNet and Pascal VOC — isn’t valuable in its own right, but only insofar as it yields insights that help us design better systems for real applications.

So it’s natural to ask: which research results will generalize to new situations?

Continue reading “Which research results will generalize?”

Prior knowledge and overfitting

When we talk about priors and regularization, we often motivate them in terms of “incorporating knowledge” or “preventing overfitting.” In a sense, the two are equivalent: any prior or regularizer must favor certain explanations relative to others, so favoring one explanation is equivalent to punishing others. But I’ll argue that these are two very different phenomena, and it’s useful to know which one is going on. Continue reading “Prior knowledge and overfitting”

Testing MCMC code, part 2: integration tests

This is the second of two posts based on a testing tutorial I’m writing with David Duvenaud.

In my last post, I talked about checking the MCMC updates using unit tests. Most of the mistakes I’ve caught in my own code were ones I caught with unit tests. (Naturally, I have no idea about the ones I haven’t caught.) But no matter how thoroughly we unit test, there are still subtle bugs that slip through the cracks. Integration testing is a more global approach, and tests the overall behavior of the software, which depends on the interaction of multiple components. Continue reading “Testing MCMC code, part 2: integration tests”

Testing MCMC code, part 1: unit tests

This post is taken from a tutorial I am writing with David Duvenaud.


When you write a nontrivial piece of software, how often do you get it completely correct on the first try?  When you implement a machine learning algorithm, how thorough are your tests?  If your answers are “rarely” and “not very,” stop and think about the implications.

There’s a large literature on testing the convergence of optimization algorithms and MCMC samplers, but I want to talk about a more basic problem here: how to test if your code correctly implements the mathematical specification of an algorithm. Continue reading “Testing MCMC code, part 1: unit tests”

Introspection in AI

I’ve recently come across a fascinating blog post by Cambridge mathematician Tim Gowers. He and computational linguist Mohan Ganesalingam built a sort of automated mathematician which does the kind of “routine” mathematical proofs that mathematicians can do without backtracking. Their system was based on a formal theory of the semantics of mathematical language, together with introspection into how they solved problems. In other words, they worked through lots of simple examples and checked that their AI could solve the problems in a way that was cognitively plausible. The goal wasn’t to build a useful system (standard theorem provers are way more powerful), but to provide insight into our problem solving process. This post reminded me that, while our field has long moved away from this style of research, I think there’s still a lot to be gained from it. Continue reading “Introspection in AI”

Fisher information


I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about:

{\bf F}_\theta &= {\mathbb E}_x[\nabla_\theta \log p(x;\theta) (\nabla_\theta \log p(x;\theta))^T] \\
&= {\rm Cov}_x[ \nabla_\theta \log p(x;\theta) ] \\
&= {\mathbb E}_x[ -\nabla^2_\theta \log p(x; \theta) ].

It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on.

This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative of KL divergence. Continue reading “Fisher information”

Geometric means of distributions

Annealed importance sampling [1] is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I’m not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.

Continue reading “Geometric means of distributions”

What is representation learning?

In my last post, I argued that a major distinction in machine learning is between predictive learning and representation learning. Now I’ll take a stab at summarizing what representation learning is about. Or, at least, what I think of as the first principal component of representation learning. Continue reading “What is representation learning?”

Predictive learning vs. representation learning

When you take a machine learning class, there’s a good chance it’s divided into a unit on supervised learning and a unit on unsupervised learning. We certainly care about this distinction for a practical reason: often there’s orders of magnitude more data available if we don’t need to collect ground-truth labels. But we also tend to think it matters for more fundamental reasons. In particular, the following are some common intuitions: Continue reading “Predictive learning vs. representation learning”

Unbiased estimators of partition functions are basically lower bounds

In machine learning, we often want to evaluate how well a model describes a dataset. In an unsupervised setting, we might use one of two criteria:

  1. marginal likelihood, or Bayes factor: the probability of the data, with all parameters and latent variables integrated out
  2. held-out likelihood, or the probability of held-out test data under the parameters learned on the training set

Both of these criteria can require computing difficult high-dimensional sums or integrals, which I’ll refer to here as the partition function. In most cases, it’s infeasible to solve these integrals exactly, so we rely on approximation techniques, such as variational inference or sampling.

Often you’ll hear claims that such-and-such an algorithm is an unbiased estimator of the partition function. What does this mean in practice, and why do we care?  Our intuitions from other sorts of unbiased estimators can be very misleading here. Continue reading “Unbiased estimators of partition functions are basically lower bounds”