Testing MCMC code, part 1: unit tests

This post is taken from a tutorial I am writing with David Duvenaud.

Overview

When you write a nontrivial piece of software, how often do you get it completely correct on the first try?  When you implement a machine learning algorithm, how thorough are your tests?  If your answers are “rarely” and “not very,” stop and think about the implications.

There’s a large literature on testing the convergence of optimization algorithms and MCMC samplers, but I want to talk about a more basic problem here: how to test if your code correctly implements the mathematical specification of an algorithm. Continue reading “Testing MCMC code, part 1: unit tests”

JIT compilation in MATLAB

A few years ago MATLAB introduced a Just-In-Time (JIT) accelerator under the hood. Because the JIT acceleration runs behind the scenes, it is easy to miss (in fact, MathWorks seems to intentionally hide it so that users do not change their coding style, probably because the JIT accelerator is changed regularly). I just wanted to briefly mention what a JIT accelerator is and what it does in MATLAB. Continue reading “JIT compilation in MATLAB”

The Gumbel-Max Trick for Discrete Distributions

[latexpage]It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation:
\begin{align*}
\pi_k &= \frac{\exp\{x_k\}}{\sum_{k’=1}^K\exp\{x_{k’}\}}
\end{align*}
where the $x_k$ are unconstrained in $\mathbb{R}$, but the $\pi_k$ live on the simplex, i.e., $\pi_k \geq 0$ and $\sum_{k}\pi_k=1$. The $x_k$ parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such discrete samples, that doesn’t actually require constructing the discrete distribution.

Continue reading “The Gumbel-Max Trick for Discrete Distributions”

Pseudo-marginal MCMC

[latexpage]This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi({\bf x})$, with ${\bf x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state ${\bf x}$ to a proposed state ${\bf x}’$ with probability $\min(1, \pi({\bf x}’)/\pi({\bf x}))$ .

But what if we cannot evaluate $\pi({\bf x})$ exactly? Such a situation might arise if we are given a joint density function $\pi({\bf x}, {\bf z})$, with ${\bf z} \in \mathcal Z$, and we must marginalize out ${\bf z}$ in order to compute $\pi({\bf x})$. In this situation, we may only be able to approximate

\[ \pi({\bf x}) = \int \pi({\bf x},{\bf z}) \, \mathrm d{\bf z} ,\] Continue reading “Pseudo-marginal MCMC”

The Alias Method: Efficient Sampling with Many Discrete Outcomes

[latexpage]
When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter $\boldsymbol{\pi}\in\mathbb{R}^K$, such that $\pi_k\geq 0$ and $\sum_{k}\pi_k=1$. A somewhat more common occurrence is that we have a $\boldsymbol{\phi}\in\mathbb{R}^K$ where $\phi_k\geq 0$, but we don’t know the normalization constant. That is, our $\boldsymbol{\phi}$ is only proportional to the multinomial parameter $\boldsymbol{\pi}$. We want to rapidly generate a variate according to $\boldsymbol{\pi}$, given $\boldsymbol{\pi}$, something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox):

cdf = cumsum(phi);
samp_k = sum(cdf < rand()*cdf(end)) + 1;

This is nice and simple, but you'll notice that it has $\mathcal{O}(K)$ time complexity for setup (computing the CDF) and $\mathcal{O}(K)$ time complexity per sample. The per-sample complexity could probably be reduced to $\mathcal{O}(\log K)$ with a better data structure for finding the threshold. It turns out, however, that we can do better and get $\mathcal{O}(1)$ for the sampling, while still being $\mathcal{O}(K)$.

Continue reading "The Alias Method: Efficient Sampling with Many Discrete Outcomes"

A Parallel Gamma Sampling Implementation

[latexpage]

I don’t have a favorite distribution, but if I had to pick one, I’d say the gamma.  Why not the Gaussian? Because everyone loves the Gaussian! But when you want a prior distribution for the mean of your Poisson, or the variance of your Normal, who’s there to pick up the mess when the Gaussian lets you down? The gamma. When you’re trying to actually sample that Dirichlet that makes such a nice prior distribution for categorical distributions over your favorite distribution (how about that tongue twister), who’s there to help you?  You guessed it, the gamma. But if you want a distribution that you can sample millions of times during each iteration of your MCMC algorithm, well, now the Gaussian is looking pretty good, but let’s not give up hope on the gamma just yet.

Continue reading “A Parallel Gamma Sampling Implementation”

Getting above the fray with lifted inference

Hi, I’m Jon. In my series of posts, I’ll be writing about how we can use the modern Bayesian toolkit to efficiently make decisions, solve problems, and formulate plans (the providence of AI), rather than restrict ourselves to approximating posteriors (the providence of statistics and much of machine learning).

Here’s a simple example of how AI can help out machine learning. What was the first graphical model you were exposed to? There’s a good chance it was Pearl’s famous “Sprinkler, Rain, Wet grass” graphical model[1]. Continue reading “Getting above the fray with lifted inference”

What is the Computational Capacity of the Brain?

One big recent piece of news from across the Atlantic is that the European Commission is funding a brain simulation project to the tune of a billion Euros. Roughly, the objective is to simulate the entire human brain using a supercomputer. Needless to say, many people are skeptical and there are lots of reasons that one might think this project is unlikely to yield useful results. One criticism centers on whether even a supercomputer can simulate the complexity of the brain. A first step towards simulating a brain is thinking about how many FLOP/s (floating point operations per second) would be necessary to implement similar functionality in conventional computer hardware. Here I will discuss two back-of-the-envelope estimates of this computational capacity. (Don’t take these too seriously, they’re a little crazy and just shooting for orders of magnitude.)
Continue reading “What is the Computational Capacity of the Brain?”

The Natural Gradient

[latexpage]

A common activity in statistics and machine learning is optimization. For instance, finding maximum likelihood and maximum a posteriori estimates require maximizing the likilihood function and posterior distribution respectively. Another example, and the motivating example for this post, is using variational inference to approximate a posterior distribution. Suppose we are interested in a posterior distribution, $p$, that we cannot compute analytically. We will approximate $p$ with the variational distribution $q(\phi)$ that is parameterized by the variational parameters $\phi$. Variational inference then proceeds to minimize the KL divergence from $q$ to $p$, $KL(q||p)$. The dominant assumption in machine learning for the form of $q$ is a product distribution, that is $q = \prod_k q_k(\phi_k)$ (where we assume there are $K$ variational parameters). It can be shown that minimizing $KL(q||p)$ is equivalent to maximizing the evidence lower bound [2], denoted $L(\phi)$.

In this post we will consider optimizing the parameters of a probability distribution and will see that using the gradient as in basic calculus can follow suboptimal directions. This can cause slow convergence or convergence to inferior local modes in non-convex problems. Along the way we will introduce some concepts from differential geometry and derive more efficient gradient directions to follow. Continue reading “The Natural Gradient”

Aversion of Inversion

[latexpage]

In the spirit of Ryan’s most recent post, I will discuss a fundamental snippet from numerical linear algebra that facilitates computation for the same price of not facilitating it. In our everyday lives, we often come across theoretical expressions that involve matrix inverses stapled to vectors, such as $\Omega^{-1}\mathbf{x}$ with $\Omega\in\mathbb{R}^{n\times n}, \mathbf{x}\in\mathbb{R}^n$. When we proceed to code this up, it is very tempting to first compute $\Omega^{-1}$. Resist doing this! There are several points for why there is no point to actually find an explicit, tangible inverse. Continue reading “Aversion of Inversion”