The ELBO without Jensen, Kullback, or Leibler

Ryan Adams · June 17, 2020

The log marginal likelihood is a central object for Bayesian inference with latent variable models:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz
\end{align*}
where $x$ are observations, $z$ are latent variables, and $\theta$ are parameters. Variational inference tackles this problem by approximating the posterior over $z$ with a simpler density $q(z)$. Often this density has a factored structure, for example. The approximating density is fit by maximizing a lower bound on the log marginal likelihood, or "evidence" (hence ELBO = evidence lower bound):
\begin{align*}
\text{ELBO}(q) &= \int q(z)\ln\frac{p(x,z\,|\,\theta)}{q(z)}\,dz \leq \ln p(x\,|\,\theta)
\end{align*}
The hope is that this will be a tight enough bound that we can use this as a proxy for the marginal likelihood when reasoning about $\theta$. The ELBO is typically derived in one of two ways: via Jensen's inequality or by writing down the KL divergence. Let's quickly review these two derivations, and then I'll show you a cute third derivation that doesn't use Jensen's inequality or K-L divergence.

ELBO via Jensen's Inequality

The Jensen's approach observes that the expectation of a concave function is always less than or equal to that function evaluated at the expectation of its argument. That is, if $f(\cdot)$ is concave, then
\begin{align*}
\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])\,.
\end{align*}
There are two steps for the Jensen's approach to the ELBO. First, multiply and divide inside the integral by $q(z)$, then apply Jensen's inequality observing that the natural log is concave and that we now hav an expectation under $q(z)$:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz\\
&= \ln \int q(z) \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(multiply and divide by $q(z)$)}\\
&\geq \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(Jensen's inequality)}\\
&= \text{ELBO}(q)\,.
\end{align*}

ELBO via Kullback-Leibler Divergence

Alternatively, we could directly write down the KL divergence between $q(z)$ and the posterior over latent variables $p(z\,|\,x,\theta)$,
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz\,.
\end{align*}
Now let's both add and subtract the log marginal likelihood $\ln p(x\,|\,\theta)$ from this:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz
- \ln p(x\,|\,\theta) + \ln p(x\,|\,\theta)\,.
\end{align*}
This log marginal likelihood doesn't actually depend on $z$ so we can wrap it in an expectation under $q(z)$ if we want. Let's do that with the first one:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz
- \int q(z)\ln p(x\,|\,\theta)\,dz + \ln p(x\,|\,\theta)\,.
\end{align*}
Now turn the first two terms into a single expectation under $q(z)$ and do the logarithm thing:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz
+ \ln p(x\,|\,\theta)\,.
\end{align*}
Rearrange things slightly:
\begin{align*}
\ln p(x\,|\,\theta) &= KL[ q(z)\,||\,p(z\,|\,x,\theta)] - \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz\\
&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \int q(z)\ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\
&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \text{ELBO}(q)\,.
\end{align*}
We know that K-L divergences have to be non-negative so that shows us we have a lower bound on the log marginal likelihood.

Alternative Derivation

Start with Bayes' rule:
\begin{align*}
p(z\,|\, x, \theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(x\,|\,\theta)}
\end{align*}
and observe that we can rearrange it to give us an expression for the marginal likelihood:
\begin{align*}
p(x\,|\,\theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(z\,|\,x,\theta)}
\end{align*}
This is true for any choice of $z$. Siddhartha Chib has made some cool estimators using this trick, that I initially learned about from Iain Murray. Now let's take the log of both sides and we'll go on and stick the first two terms back together:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta)\,.
\end{align*}
Remember, this is true for all $z$. Let's add and subtract the log of an arbitrary density $q(z)$:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta) - \ln q(z) + \ln q(z)\\
&= \ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)}\,.
\end{align*}
Now, recall that $\ln u \leq u - 1$ and so $-\ln u \geq 1 - u$. Therefore
\begin{align*}
\ln p(x\,|\,\theta) &=\ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)} \geq \ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\,.
\end{align*}
Since this is true for all $z$, it's also true in expectation under any distribution we want. $q(z)$ is a natural choice:
\begin{align*}
\ln p(x\,|\,\theta) &\geq \int q(z)\left(\ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\right)\, dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int q(z)\frac{p(z\,|\, x,\theta)}{q(z)}\,dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int p(z\,|\, x,\theta)\,dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - 1\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\
&= \text{ELBO}(q)
\end{align*}
No reference here to Jensen's inequality or K-L divergence. One caveat, however, is that the log inequality I used here is one way to prove non-negativity of K-L divergence. You could do this in a different order and it would look like directly taking advantage of the non-negativity of KL in the lower bound.

Twitter, Facebook