Variational Inference (part 1)
I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic.
The goal of variational inference is to approximate an intractable probability distribution, , with a tractable one,
, in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.
- Intractable
: a motivating example is the posterior distribution of a Bayesian model, i.e. given some observations
and some model
parameterized by
, we often want to evaluate the distribution over parameters
For a lot of interesting models this distribution is intractable to deal with because of the integral in the denominator. We can evaluate the posterior up to a constant, but we can’t compute the normalization constant. Applying variational inference to posterior distributions is sometimes called variational Bayes.
- A tractable posterior distribution is one for which we can evaluate the integral (and therefore take expectations with it). One way to achieve this is by making each
independent:
for some marginal distributions
parameterized by
. This is called the ‘mean field’ approximation.
- We can make two distributions ‘close’ in the Kullback-Leibler sense:
Notice that we chose
and not
because of the intractability – we’re assuming you cannot evaluate
.
In order to minimize the KL divergence , note the following decomposition
where falls out of the expectation with respect to
. This is convenient because
(the log evidence) is going to be constant with respect to the variational distribution
and the model parameters
.
And because KL divergence is strictly nonnegative, the second term is a lower bound for
, also known as the evidence lower bound (ELBO). In order to minimize the first term,
, it suffices to maximize the second.
One way to optimize over the choice of is to consider the ELBO with respect some
, separating it from the expectation with respect to all other variables,
(note that
is the integral over all
with
):
where
This motivates a particular updating scheme: iterate over the marginals , maximizing the ELBO at each step with
, and repeat.
These statements remain pretty general. We haven’t specified the functional form of , but it will fall out of the
for specific models (a good, simple example is the normal-gamma model from Murphy’s MLAPP or Bishop’s PRML). This form will then define the variational parameters
, and the iterative algorithm will provide a way to compute them.
Some fun properties of the mean field approximation:
- The optimization function is not guaranteed to be convex (wainwright and jordan)
- The optimization procedure is pretty much out of the box for exponential family models (maybe a future blog post will be dedicated to exploring this)
- This variational approximation underestimates uncertainty (a consequence that pops out as a result of the KL divergence ordering,
as opposed to
).
This begs the question, can you do better with richer distributions? Yes! In future posts, I will take a look at some more complicated variational distributions and their performance.
references:
- Kevin Murphy. Machine learning: A probabilistic perspective
- Wainwright and Jordan. Graphical Models, Exponential Families, and Variational Inference