Geometric means of distributions
Annealed importance sampling [1] is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I’m not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.
When we use AIS, we have some (unnormalized) probability distribution we’re interested in sampling from, such as the posterior of a probabilistic model given observations. We also have a distribution
which is easy to sample from, such as the prior. We need to choose a sequence of unnormalized distributions
indexed a continuous parameter
which smoothly interpolate between
and
. (I use
because later on,
and
will correspond to directed graphical models, but the intermediate distributions will not.)
A common choice is to take a weighted geometric mean of the two distributions. That is, . This is motivated by examples like the following, where we gradually “anneal” from a uniform distribution to a multimodal one:
This is the desired behavior: the intermediate distributions look like something in between and
.
However, most of the probabilistic models we work with include latent variables in addition to
, and the distribution
is defined implicitly in terms of the integral
. In these cases, computing the geometric mean as we did above can be intractable, since it requires summing/integrating out the latent variables. Another approach would be to take the geometric mean of the full joint distribution over
and
. However, this geometric mean might not even be defined, if the dimensionality of
in the two models is different.
Instead, we often use the following “doubling” trick introduced by [2]. We introduce an expanded state space which includes two sets of latent variables and
. We can view
as a distribution
over the expanded state space where
depends only on
, and hence
is irrelevant to the model. Similarly,
defines a distribution on this space where
depends only on
. Then it makes mathematical sense to take the geometric mean in this expanded space.
For instance, suppose and
both define directed graphical models, i.e. they are defined in terms of a factorization
. For simplicity, assume nothing is observed. Then the intermediate distributions would take the following form:
Intuitively, and
are both drawn from their corresponding priors in each distribution, and we gradually reduce the coupling between
and
, and increase the coupling between
and
.
Consider the following model, where and
are vectors in
:
In this model, is always distributed as
, and it is coupled with the latent variable
to an extent that depends on the coupling parameter
. The two models
and
happen to be identical. This is a toy example, but the problem it will illustrate is something I’ve run into in practice.
We might expect that when we take the geometric mean of these distributions, we get the same distribution, or at least something close. Not so. For instance, the latent variables and
are independent under either
or
, since both are directed models with no evidence. In the univariate case (
), here is the joint distribution of
and
:
Counterintuitively, when we take the geometric mean (), the two variables become coupled:
What’s going on? The following figure shows what happens when we take the geometric mean (shown in red) of and
, with
and
:
Essentially, when two distributions disagree with each other, their geometric mean is small everywhere. This means less probability mass is allocated to regions where and
are much different. Therefore, the geometric mean causes
and
to become positively correlated.
Now let’s make the model a bit more complicated by adding the coupling parameter as a random variable. Let’s give it a uniform prior. Under
and
, the distribution over
is uniform, since they are directed models with no evidence.
However, under the geometric mean distribution , the marginal distribution over
is very much non-uniform:
(In this figure, I used .) When the coupling parameter is large,
and
are required to agree with each other, whereas when
is small, it doesn’t matter. Therefore, the distribution strongly “prefers” explanations which involve the latent variables as little as possible.
In general, geometric means of two complex distributions can have properties very different from either distribution individually. This can lead to strange behaviors that make you suspect there’s a bug in your sampler. I don’t know of any good alternative to the approach given above, but don’t be surprised if geometric means don’t always do what you expect.
[1] Radford Neal. Annealed Importance Sampling. Statistics and Computing, 2001. [2] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. ICML 2008.