# The ELBO without Jensen, Kullback, or Leibler

The log marginal likelihood is a central object for Bayesian inference with latent variable models:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz
\end{align*}
where $x$ are observations, $z$ are latent variables, and $\theta$ are parameters. Variational inference tackles this problem by approximating the posterior over $z$ with a simpler density $q(z)$. Often this density has a factored structure, for example. The approximating density is fit by maximizing a lower bound on the log marginal likelihood, or "evidence" (hence ELBO = evidence lower bound):
\begin{align*}
\text{ELBO}(q) &= \int q(z)\ln\frac{p(x,z\,|\,\theta)}{q(z)}\,dz \leq \ln p(x\,|\,\theta)
\end{align*}
The hope is that this will be a tight enough bound that we can use this as a proxy for the marginal likelihood when reasoning about $\theta$. The ELBO is typically derived in one of two ways: via Jensen's inequality or by writing down the KL divergence. Let's quickly review these two derivations, and then I'll show you a cute third derivation that doesn't use Jensen's inequality or K-L divergence.

### ELBO via Jensen's Inequality

The Jensen's approach observes that the expectation of a concave function is always less than or equal to that function evaluated at the expectation of its argument. That is, if $f(\cdot)$ is concave, then
\begin{align*}
\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])\,.
\end{align*}
There are two steps for the Jensen's approach to the ELBO. First, multiply and divide inside the integral by $q(z)$, then apply Jensen's inequality observing that the natural log is concave and that we now hav an expectation under $q(z)$:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz\\
&= \ln \int q(z) \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(multiply and divide by $q(z)$)}\\
&\geq \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(Jensen's inequality)}\\
&= \text{ELBO}(q)\,.
\end{align*}

### ELBO via Kullback-Leibler Divergence

Alternatively, we could directly write down the KL divergence between $q(z)$ and the posterior over latent variables $p(z\,|\,x,\theta)$,
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz\,.
\end{align*}
Now let's both add and subtract the log marginal likelihood $\ln p(x\,|\,\theta)$ from this:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz
- \ln p(x\,|\,\theta) + \ln p(x\,|\,\theta)\,.
\end{align*}
This log marginal likelihood doesn't actually depend on $z$ so we can wrap it in an expectation under $q(z)$ if we want. Let's do that with the first one:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz
- \int q(z)\ln p(x\,|\,\theta)\,dz + \ln p(x\,|\,\theta)\,.
\end{align*}
Now turn the first two terms into a single expectation under $q(z)$ and do the logarithm thing:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz
+ \ln p(x\,|\,\theta)\,.
\end{align*}
Rearrange things slightly:
\begin{align*}
\ln p(x\,|\,\theta) &= KL[ q(z)\,||\,p(z\,|\,x,\theta)] - \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz\\
&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \int q(z)\ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\
&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \text{ELBO}(q)\,.
\end{align*}
We know that K-L divergences have to be non-negative so that shows us we have a lower bound on the log marginal likelihood.

### Alternative Derivation

\begin{align*}
p(z\,|\, x, \theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(x\,|\,\theta)}
\end{align*}
and observe that we can rearrange it to give us an expression for the marginal likelihood:
\begin{align*}
p(x\,|\,\theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(z\,|\,x,\theta)}
\end{align*}
This is true for any choice of $z$. Siddhartha Chib has made some cool estimators using this trick, that I initially learned about from Iain Murray. Now let's take the log of both sides and we'll go on and stick the first two terms back together:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta)\,.
\end{align*}
Remember, this is true for all $z$. Let's add and subtract the log of an arbitrary density $q(z)$:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta) - \ln q(z) + \ln q(z)\\
&= \ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)}\,.
\end{align*}
Now, recall that $\ln u \leq u - 1$ and so $-\ln u \geq 1 - u$. Therefore
\begin{align*}
\ln p(x\,|\,\theta) &=\ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)} \geq \ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\,.
\end{align*}
Since this is true for all $z$, it's also true in expectation under any distribution we want. $q(z)$ is a natural choice:
\begin{align*}
\ln p(x\,|\,\theta) &\geq \int q(z)\left(\ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\right)\, dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int q(z)\frac{p(z\,|\, x,\theta)}{q(z)}\,dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int p(z\,|\, x,\theta)\,dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - 1\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\
&= \text{ELBO}(q)
\end{align*}
No reference here to Jensen's inequality or K-L divergence. One caveat, however, is that the log inequality I used here is one way to prove non-negativity of K-L divergence. You could do this in a different order and it would look like directly taking advantage of the non-negativity of KL in the lower bound.

A few years ago, I co-founded the Talking Machines podcast with Katy Gorman and we co-hosted it together for two seasons. After those two seasons, I decided it was time to move on and let someone else host it, but it was a highly rewarding experience. Since then, I've contemplated starting a YouTube channel, motivated in part by that outreach experience and also by the amazing people on the platform who I've enjoyed watching and learning from. I've often felt there was a hole on machine learning topics on YouTube. There are zillions of deep learning "explainer" videos and of course many online lectures, but there's not a lot that 1) explains the broader technical context of machine learning, and 2) provides insight into the research process as a human endeavor.

Things get in the way, though, and it's hard to make that first video and feel good about what you've made. So despite having some nice equipment and footage in the can, I've never posted something.

Then came COVID-19 and I had to take my 100+ person course online in a hurry. As soon as the undergrads got kicked off campus, I polled the class to get an understanding about what time zones they were in and how I could create a good course experience for as many of the students as possible. There were 11 time zones represented, so I realized that doing a live lecture on zoom was not going to work. Instead I would need to pre-record videos. This was the forcing function I needed to finally start making and publishing videos.

I took the plunge and learned how to use manim, the system that Grant Sanderson created for 3Blue1Brown. It's a kind of crazy blend of Python, LaTeX and various mathematical abstractions. I developed a workflow in which I'd create a sequence of scenes for a topic, stitch them together in Final Cut Pro along with the voiceover. Each of them was a large amount of work, even though the videos are only a few minutes long. The response from the students was positive; they ended up being very dense but the format allowed for good visualizations and they could pause and rewind easily.

The course itself --- COS 302 Mathematics for Numerical Computing and Machine Learning --- aims to background on linear algebra and probability, along with some basic vector calculus and optimization, to prepare students for taking higher level courses in machine learning, computer vision, graphics, etc. Essentially, the tries to close the gap that exists in which computer science majors tend to not take a lot of continuous mathematics, but several important subareas (such as ML) depend on this kind of math. We covered most of the linear algebra in the first half of the course, before the COVID-19 pandemic started, so there aren't videos about that.

Here is the first one -- explaining why we care about differentiation in machine learning: