# Geometric means of distributions

Roger GrosseMachine Learning

Annealed importance sampling  is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I’m not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.

When we use AIS, we have some (unnormalized) probability distribution we’re interested in sampling from, such as the posterior of a probabilistic model given observations. We also have a distribution which is easy to sample from, such as the prior. We need to choose a sequence of unnormalized distributions indexed a continuous parameter which smoothly interpolate between and . (I use because later on, and will correspond to directed graphical models, but the intermediate distributions will not.)

A common choice is to take a weighted geometric mean of the two distributions. That is, . This is motivated by examples like the following, where we gradually “anneal” from a uniform distribution to a multimodal one:

This is the desired behavior: the intermediate distributions look like something in between and .

However, most of the probabilistic models we work with include latent variables in addition to , and the distribution is defined implicitly in terms of the integral . In these cases, computing the geometric mean as we did above can be intractable, since it requires summing/integrating out the latent variables. Another approach would be to take the geometric mean of the full joint distribution over and . However, this geometric mean might not even be defined, if the dimensionality of in the two models is different.

Instead, we often use the following “doubling” trick introduced by . We introduce an expanded state space which includes two sets of latent variables and . We can view as a distribution over the expanded state space where depends only on , and hence is irrelevant to the model. Similarly, defines a distribution on this space where depends only on . Then it makes mathematical sense to take the geometric mean in this expanded space.

For instance, suppose and both define directed graphical models, i.e. they are defined in terms of a factorization . For simplicity, assume nothing is observed. Then the intermediate distributions would take the following form: Intuitively, and are both drawn from their corresponding priors in each distribution, and we gradually reduce the coupling between and , and increase the coupling between and .

Consider the following model, where and are vectors in : In this model, is always distributed as , and it is coupled with the latent variable to an extent that depends on the coupling parameter . The two models and happen to be identical. This is a toy example, but the problem it will illustrate is something I’ve run into in practice.

We might expect that when we take the geometric mean of these distributions, we get the same distribution, or at least something close. Not so. For instance, the latent variables and are independent under either or , since both are directed models with no evidence. In the univariate case ( ), here is the joint distribution of and : Counterintuitively, when we take the geometric mean ( ), the two variables become coupled: What’s going on?  The following figure shows what happens when we take the geometric mean (shown in red) of and , with and : Essentially, when two distributions disagree with each other, their geometric mean is small everywhere. This means less probability mass is allocated to regions where and are much different. Therefore, the geometric mean causes and to become positively correlated.

Now let’s make the model a bit more complicated by adding the coupling parameter as a random variable. Let’s give it a uniform prior. Under and , the distribution over is uniform, since they are directed models with no evidence.

However, under the geometric mean distribution , the marginal distribution over is very much non-uniform: (In this figure, I used .) When the coupling parameter is large, and are required to agree with each other, whereas when is small, it doesn’t matter. Therefore, the distribution strongly “prefers” explanations which involve the latent variables as little as possible.

In general, geometric means of two complex distributions can have properties very different from either distribution individually. This can lead to strange behaviors that make you suspect there’s a bug in your sampler. I don’t know of any good alternative to the approach given above, but don’t be surprised if geometric means don’t always do what you expect.

 Radford Neal. Annealed Importance Sampling. Statistics and Computing, 2001.

 Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. ICML 2008.