Video: Probability Density and Mass Functions

Ryan Adams · August 2, 2020

Probability mass functions and probability density functions are key objects for reasoning about uncertainty. This video by Ari Seff gives an introduction. This is part of a series of videos for COS 302: Mathematics for Numerical Computation and Machine Learning, replacing lectures after the course went remote due to the COVID-19 pandemic.

Video: Probability Spaces and Random Variables

Ryan Adams · July 26, 2020

Probability is a big part of many aspects of machine learning, and this video looks at some of the basic formalisms for it. This is part of a series of videos for COS 302: Mathematics for Numerical Computation and Machine Learning, replacing lectures after the course went remote due to the COVID-19 pandemic.

Video: Why is Probability Important to Machine Learning?

Ryan Adams · July 19, 2020

Probability is a big part of many aspects of machine learning, but this may not be totally obvious from the outset. In this video we take a look at different ways that probability informs ML. This is part of a series of videos for COS 302: Mathematics for Numerical Computation and Machine Learning, replacing lectures after the course went remote due to the COVID-19 pandemic.

Video: Why is the Gradient the Direction of Steepest Ascent

Ryan Adams · July 12, 2020

We often talk about the gradient of a scalar function as being the direction of steepest ascent. Rather than taking that for granted, let's convince ourselves that it is true. This is part of a series of videos for COS 302: Mathematics for Numerical Computation and Machine Learning, replacing lectures after the course went remote due to the COVID-19 pandemic.

Video: Derivative as the Best Affine Approximation

Ryan Adams · July 5, 2020

A useful way to think about derivatives (and gradients/Jacobians more generally) is as the maps that give you the best affine approximation at a point. This is part of a series of videos for COS 302: Mathematics for Numerical Computation and Machine Learning, replacing lectures after the course went remote due to the COVID-19 pandemic.

Video: Partial Derivatives

Ryan Adams · June 28, 2020

We think about gradients a lot in machine learning. This video talks about partial derivatives in general and Jacobian matrices, which specialize to gradients for scalar functions. This is part of a series of videos for COS 302: Mathematics for Numerical Computation and Machine Learning, replacing lectures after the course went remote due to the COVID-19 pandemic.

Video: Basics of Differentiation

Ryan Adams · June 21, 2020

Differentiation is fundamental to lots of different aspects of machine learning. This video starts at the beginning and talks about the limit of the difference quotient, the product rule, and the chain rule. This is part of a series of videos for COS 302: Mathematics for Numerical Computation and Machine Learning, replacing lectures after the course went remote due to the COVID-19 pandemic.

The ELBO without Jensen, Kullback, or Leibler

Ryan Adams · June 17, 2020

The log marginal likelihood is a central object for Bayesian inference with latent variable models:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz
\end{align*}
where $x$ are observations, $z$ are latent variables, and $\theta$ are parameters. Variational inference tackles this problem by approximating the posterior over $z$ with a simpler density $q(z)$. Often this density has a factored structure, for example. The approximating density is fit by maximizing a lower bound on the log marginal likelihood, or "evidence" (hence ELBO = evidence lower bound):
\begin{align*}
\text{ELBO}(q) &= \int q(z)\ln\frac{p(x,z\,|\,\theta)}{q(z)}\,dz \leq \ln p(x\,|\,\theta)
\end{align*}
The hope is that this will be a tight enough bound that we can use this as a proxy for the marginal likelihood when reasoning about $\theta$. The ELBO is typically derived in one of two ways: via Jensen's inequality or by writing down the KL divergence. Let's quickly review these two derivations, and then I'll show you a cute third derivation that doesn't use Jensen's inequality or K-L divergence.

ELBO via Jensen's Inequality

The Jensen's approach observes that the expectation of a concave function is always less than or equal to that function evaluated at the expectation of its argument. That is, if $f(\cdot)$ is concave, then
\begin{align*}
\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])\,.
\end{align*}
There are two steps for the Jensen's approach to the ELBO. First, multiply and divide inside the integral by $q(z)$, then apply Jensen's inequality observing that the natural log is concave and that we now hav an expectation under $q(z)$:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln\int p(x,z\,|\,\theta)\,dz\\
&= \ln \int q(z) \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(multiply and divide by $q(z)$)}\\
&\geq \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\qquad\text{(Jensen's inequality)}\\
&= \text{ELBO}(q)\,.
\end{align*}

ELBO via Kullback-Leibler Divergence

Alternatively, we could directly write down the KL divergence between $q(z)$ and the posterior over latent variables $p(z\,|\,x,\theta)$,
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz\,.
\end{align*}
Now let's both add and subtract the log marginal likelihood $\ln p(x\,|\,\theta)$ from this:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz
- \ln p(x\,|\,\theta) + \ln p(x\,|\,\theta)\,.
\end{align*}
This log marginal likelihood doesn't actually depend on $z$ so we can wrap it in an expectation under $q(z)$ if we want. Let's do that with the first one:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)}\,dz
- \int q(z)\ln p(x\,|\,\theta)\,dz + \ln p(x\,|\,\theta)\,.
\end{align*}
Now turn the first two terms into a single expectation under $q(z)$ and do the logarithm thing:
\begin{align*}
KL[ q(z)\,||\,p(z\,|\,x,\theta)] &= \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz
+ \ln p(x\,|\,\theta)\,.
\end{align*}
Rearrange things slightly:
\begin{align*}
\ln p(x\,|\,\theta) &= KL[ q(z)\,||\,p(z\,|\,x,\theta)] - \int q(z)\ln \frac{q(z)}{p(z\,|\,x,\theta)p(x\,|\,\theta)}\,dz\\
&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \int q(z)\ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\
&= KL[ q(z)\,||\,p(z\,|\,x,\theta)] + \text{ELBO}(q)\,.
\end{align*}
We know that K-L divergences have to be non-negative so that shows us we have a lower bound on the log marginal likelihood.

Alternative Derivation

Start with Bayes' rule:
\begin{align*}
p(z\,|\, x, \theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(x\,|\,\theta)}
\end{align*}
and observe that we can rearrange it to give us an expression for the marginal likelihood:
\begin{align*}
p(x\,|\,\theta) &= \frac{p(x\,|\,z,\theta)\,p(z\,|\,\theta)}{p(z\,|\,x,\theta)}
\end{align*}
This is true for any choice of $z$. Siddhartha Chib has made some cool estimators using this trick, that I initially learned about from Iain Murray. Now let's take the log of both sides and we'll go on and stick the first two terms back together:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta)\,.
\end{align*}
Remember, this is true for all $z$. Let's add and subtract the log of an arbitrary density $q(z)$:
\begin{align*}
\ln p(x\,|\,\theta) &= \ln p(x,z\,|\,\theta) - \ln p(z\,|\,x,\theta) - \ln q(z) + \ln q(z)\\
&= \ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)}\,.
\end{align*}
Now, recall that $\ln u \leq u - 1$ and so $-\ln u \geq 1 - u$. Therefore
\begin{align*}
\ln p(x\,|\,\theta) &=\ln p(x,z\,|\,\theta) - \ln q(z) - \ln\frac{p(z\,|\, x,\theta)}{q(z)} \geq \ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\,.
\end{align*}
Since this is true for all $z$, it's also true in expectation under any distribution we want. $q(z)$ is a natural choice:
\begin{align*}
\ln p(x\,|\,\theta) &\geq \int q(z)\left(\ln p(x,z\,|\,\theta) - \ln q(z) + 1 - \frac{p(z\,|\, x,\theta)}{q(z)}\right)\, dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int q(z)\frac{p(z\,|\, x,\theta)}{q(z)}\,dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - \int p(z\,|\, x,\theta)\,dz\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz + 1 - 1\\
&= \int q(z) \ln \frac{p(x,z\,|\,\theta)}{q(z)}\,dz\\
&= \text{ELBO}(q)
\end{align*}
No reference here to Jensen's inequality or K-L divergence. One caveat, however, is that the log inequality I used here is one way to prove non-negativity of K-L divergence. You could do this in a different order and it would look like directly taking advantage of the non-negativity of KL in the lower bound.

Starting a YouTube Channel

Ryan Adams · June 14, 2020

A few years ago, I co-founded the Talking Machines podcast with Katy Gorman and we co-hosted it together for two seasons. After those two seasons, I decided it was time to move on and let someone else host it, but it was a highly rewarding experience. Since then, I've contemplated starting a YouTube channel, motivated in part by that outreach experience and also by the amazing people on the platform who I've enjoyed watching and learning from. I've often felt there was a hole on machine learning topics on YouTube. There are zillions of deep learning "explainer" videos and of course many online lectures, but there's not a lot that 1) explains the broader technical context of machine learning, and 2) provides insight into the research process as a human endeavor.

Things get in the way, though, and it's hard to make that first video and feel good about what you've made. So despite having some nice equipment and footage in the can, I've never posted something.

Then came COVID-19 and I had to take my 100+ person course online in a hurry. As soon as the undergrads got kicked off campus, I polled the class to get an understanding about what time zones they were in and how I could create a good course experience for as many of the students as possible. There were 11 time zones represented, so I realized that doing a live lecture on zoom was not going to work. Instead I would need to pre-record videos. This was the forcing function I needed to finally start making and publishing videos.

I took the plunge and learned how to use manim, the system that Grant Sanderson created for 3Blue1Brown. It's a kind of crazy blend of Python, LaTeX and various mathematical abstractions. I developed a workflow in which I'd create a sequence of scenes for a topic, stitch them together in Final Cut Pro along with the voiceover. Each of them was a large amount of work, even though the videos are only a few minutes long. The response from the students was positive; they ended up being very dense but the format allowed for good visualizations and they could pause and rewind easily.

The course itself --- COS 302 Mathematics for Numerical Computing and Machine Learning --- aims to background on linear algebra and probability, along with some basic vector calculus and optimization, to prepare students for taking higher level courses in machine learning, computer vision, graphics, etc. Essentially, the tries to close the gap that exists in which computer science majors tend to not take a lot of continuous mathematics, but several important subareas (such as ML) depend on this kind of math. We covered most of the linear algebra in the first half of the course, before the COVID-19 pandemic started, so there aren't videos about that.

Here is the first one -- explaining why we care about differentiation in machine learning:

Discrete Object Generation with Reversible Inductive Construction

Ryan Adams · December 2, 2019

The success of generative modeling in continuous domains has led to a surge of interest in generating discrete data such as molecules, source code, and graphs. However, construction histories for these discrete objects are typically not unique and so generative models must reason about intractably large spaces in order to learn. Additionally, structured discrete domains are often characterized by strict constraints on what constitutes a valid object and generative models must respect these requirements in order to produce useful novel samples. Here, we present a generative model for discrete objects employing a Markov chain where transitions are restricted to a set of local operations that preserve validity.

« Prev 1 2 3 4 5 Next »