In machine learning, we often want to evaluate how well a model describes a dataset. In an unsupervised setting, we might use one of two criteria:

- marginal likelihood, or Bayes factor: the probability of the data, with all parameters and latent variables integrated out
- held-out likelihood, or the probability of held-out test data under the parameters learned on the training set

Both of these criteria can require computing difficult high-dimensional sums or integrals, which I’ll refer to here as the partition function. In most cases, it’s infeasible to solve these integrals exactly, so we rely on approximation techniques, such as variational inference or sampling.

Often you’ll hear claims that such-and-such an algorithm is an unbiased estimator of the partition function. What does this mean in practice, and why do we care? Our intuitions from other sorts of unbiased estimators can be very misleading here.

Ordinarily, we like unbiased Monte Carlo estimators because we can improve the accuracy by computing more samples. If we have a stochastic estimator \(\hat{\theta}\) of some quantity \(\theta\) which is unbiased and has finite variance, we can average \(N\) samples together and the mean squared error will decrease like \(1/N\). Unfortunately, unless we’re very careful, Monte Carlo estimators of partition functions generally have infinite variance, or at least such ridiculously large variance that it might as well be infinite. In these cases, it doesn’t matter how many samples we compute — we’re not going to get an accurate answer.

For concreteness, let’s consider estimating a partition function using importance sampling. Suppose we have an undirected graphical model over variables \(x\) with unnormalized distribution \(f(x)\), and we want to estimate the partition function \(Z = \sum_x f(x)\). The importance sampling estimator draws a sample from some tractable distribution \(q(x)\), and then returns the estimate \(f(x) / q(x)\). (We could average over multiple samples, but let’s just use one for simplicity.) This is an unbiased estimator, because

\({\mathbb E}_{q(x)} \left[ \frac{f(x)}{q(x)} \right] = \sum_x q(x) \frac{f(x)}{q(x)} = \sum_x f(x) = Z.\)This naive importance sampling estimator is hardly ever used in practice because it usually has extremely large, or infinite, variance. To get a sense for how common a problem this is, consider the following pair of probability distributions:

\(p(x) = 0.999 N(x; 0, 1) + 0.001 N(x; 0, 2)\) \(q(x) = N(x; 0, 1)\)Clearly, p and q represent nearly the same distribution. They should be a piece of cake for any reasonable algorithm. However, if we perform importance sampling with p as the target distribution and q as the proposal distribution, the importance weights will have infinite variance. The problem is that even though the tails of both distributions decay very quickly, the importance weights grow faster than the tails decay. In order to guarantee finite variance, you need to worry even about parts of the distribution which are unlikely and should have little impact on the quantity of interest.

Does this mean it’s pointless to talk about unbiased estimators of partition functions? I’d say no — instead, we should think of them as biased estimators of the log partition function. Partition functions vary over a lot of orders of magnitude, so it’s more natural to think of them on a log scale. And when we take the log, the problem of occasionally getting extremely large values disappears.

I’ll give two simple arguments why we should view unbiased estimators of partition functions as stochastic lower bounds. First, Markov’s inequality says that for any nonnegative random variable \(X\), \(P(X > aE[X]) < 1/a\). In the case of partition function estimators \(\hat{Z}\) (which are usually nonnegative), this implies that \(P(\log \hat{Z} > \log Z + b) < e^{-b}\). In other words, overestimates of the true value are exponentially unlikely.

The second reason to view unbiased estimators as lower bounds comes from applying Jensen’s inequality:

\(E[\log \hat{Z}] \leq \log E[\hat{Z}] = \log Z.\)So its expected value is no larger than the true log partition function. In the case of importance sampling, by expanding out the definitions and rearranging terms, we can go even further and determine the bias of the estimator. If q is the proposal distribution and p is the target distribution, the bias (i.e. \(\log Z – E[\log \hat{Z}]\)) is given by the KL divergence \(D(q \| p)\). Interestingly, this is the same bias as we get in variational Bayes, using q as the approximating distribution. I find this a pretty surprising connection, and I haven’t seen it explicitly mentioned anywhere.

So if you don’t know anything about a partition function estimator other than that it’s unbiased, you already know how to think about it: it’s like a lower bound on the true value. This is pretty convenient if you’re trying to compute the marginal likelihood of your model. You know your model is almost certainly at least as good as your estimator says.

On the other hand, suppose you’re trying to estimate held-out likelihood for an undirected graphical model. In that case, we care about the partition function because it appears in the denominator of the likelihood. A lower bound on the partition function, therefore, translates into an upper bound on the likelihood. If your estimator is wrong, it can be arbitrarily optimistic. Basically, you have no way of telling if your awesome results are due to your model being good, or your partition function estimator being bad. For this reason, experiments which measure the likelihood of undirected models are held to a very high standard. It’s necessary to do convergence diagnostics, and even then we don’t really have any way of being sure the answer is reasonable.

This, in a nutshell, is what you should think of when you hear a partition function estimator is unbiased.