[latexpage]

The log marginal likelihood is a central object for Bayesian inference with latent variable models:

\begin{align*}

\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz

\end{align*}

where $x$ are observations, $z$ are latent variables, and $\theta$ are parameters. Variational inference tackles this problem by approximating the posterior over $z$ with a simpler density $q(z)$. Often this density has a factored structure, for example. The approximating density is fit by maximizing a lower bound on the log marginal likelihood, or "evidence" (hence ELBO = evidence lower bound):

\begin{align*}

\text{ELBO}(q) &= \int q(z)\ln\frac{p(x,z\,|\,\theta)}{q(z)}\,dz \leq \ln p(x\,|\,\theta)

\end{align*}

The hope is that this will be a tight enough bound that we can use this as a proxy for the marginal likelihood when reasoning about $\theta$. The ELBO is typically derived in one of two ways: via Jensen's inequality or by writing down the KL divergence. Let's quickly review these two derivations, and then I'll show you a cute third derivation that doesn't use Jensen's inequality or K-L divergence.

### ELBO via Jensen's Inequality

The Jensen's approach observes that the expectation of a concave function is always less than or equal to that function evaluated at the expectation of its argument. That is, if $f(\cdot)$ is concave, then

\begin{align*}

\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])\,.

\end{align*}

There are two steps for the Jensen's approach to the ELBO. First, multiply and divide inside the integral by $q(z)$, then apply Jensen's inequality observing that the natural log is concave and that we now hav an expectation under $q(z)$:

\begin{align*}

\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz\\

&= \ln \int q(z) \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(multiply and divide by $q(z)$)}\\

&\geq \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(Jensen's inequality)}\\

&= \text{ELBO}(q)\,.

\end{align*}

### ELBO via Kullback-Leibler Divergence

Alternatively, we could directly write down the KL divergence between $q(z)$ and the posterior over latent variables $p(z\,|\,x,\theta)$,

\begin{align*}

KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz\,.

\end{align*}

Now let's both add and subtract the log marginal likelihood $\ln p(x\,|\,\theta)$ from this:

\begin{align*}

KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz

- \ln p(x\,|\,\theta) + \ln p(x\,|\,\theta)\,.

\end{align*}

This log marginal likelihood doesn't actually depend on $z$ so we can wrap it in an expectation under $q(z)$ if we want. Let's do that with the first one:

\begin{align*}

KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz

- \int q(z)\ln p(x\,|\,\theta)\,dz + \ln p(x\,|\,\theta)\,.

\end{align*}

Now turn the first two terms into a single expectation under $q(z)$ and do the logarithm thing:

\begin{align*}

KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz

+ \ln p(x\,|\,\theta)\,.

\end{align*}

Rearrange things slightly:

\begin{align*}

\ln p(x\,|\,\theta) &= KL[ q(z)\,||\,p(z\,|\,x,\theta)] - \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz\\

&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \int q(z)\ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\

&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \text{ELBO}(q)\,.

\end{align*}

We know that K-L divergences have to be non-negative so that shows us we have a lower bound on the log marginal likelihood.

### Alternative Derivation

Start with Bayes' rule:

\begin{align*}

p(z\,|\, x, \theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(x\,|\,\theta)}

\end{align*}

and observe that we can rearrange it to give us an expression for the marginal likelihood:

\begin{align*}

p(x\,|\,\theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(z\,|\,x,\theta)}

\end{align*}

This is true for **any** choice of $z$. Siddhartha Chib has made some cool estimators using this trick, that I initially learned about from Iain Murray. Now let's take the log of both sides and we'll go on and stick the first two terms back together:

\begin{align*}

\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta)\,.

\end{align*}

Remember, this is true for all $z$. Let's add and subtract the log of an arbitrary density $q(z)$:

\begin{align*}

\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta) - \ln q(z) + \ln q(z)\\

&= \ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)}\,.

\end{align*}

Now, recall that $\ln u \leq u - 1$ and so $-\ln u \geq 1 - u$. Therefore

\begin{align*}

\ln p(x\,|\,\theta) &=\ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)} \geq \ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\,.

\end{align*}

Since this is true for all $z$, it's also true in expectation under any distribution we want. $q(z)$ is a natural choice:

\begin{align*}

\ln p(x\,|\,\theta) &\geq \int q(z)\left(\ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\right)\, dz\\

&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int q(z)\frac{p(z\,|\, x,\theta)}{q(z)}\,dz\\

&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int p(z\,|\, x,\theta)\,dz\\

&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - 1\\

&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\

&= \text{ELBO}(q)

\end{align*}

No reference here to Jensen's inequality or K-L divergence. One caveat, however, is that the log inequality I used here is one way to prove non-negativity of K-L divergence. You could do this in a different order and it would look like directly taking advantage of the non-negativity of KL in the lower bound.