## Pseudo-marginal MCMC

[latexpage]This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi({\bf x})$, with ${\bf x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state ${\bf x}$ to a proposed state ${\bf x}’$ with probability $\min(1, \pi({\bf x}’)/\pi({\bf x}))$ .

But what if we cannot evaluate $\pi({\bf x})$ exactly? Such a situation might arise if we are given a joint density function $\pi({\bf x}, {\bf z})$, with ${\bf z} \in \mathcal Z$, and we must marginalize out ${\bf z}$ in order to compute $\pi({\bf x})$. In this situation, we may only be able to approximate

$\pi({\bf x}) = \int \pi({\bf x},{\bf z}) \, \mathrm d{\bf z} ,$ Continue reading “Pseudo-marginal MCMC”

## Chernoff’s bound

[latexpage]

If you have some randomness in your life, chances are that you want to try Chernoff’s bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff’s bound or its cousins).

My favorite form of Chernoff’s bound is: for $X_1, …, X_n$ independent binary random variables, and $X=\sum_{i=1}^n X_i$, $\mu = E[X]$ and $\delta > 2e-1$, then

$P(X > (1+\delta)\mu) < 2^{-\delta \mu}.$

Note that $X_i’s$ are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so $\delta$ is typically larger than $2e-1$.

In the standard applications, the stochastic system has size $N$ and an event of interest, $X$, has expectation $\mu=O(\log N)$. The probability that any one event deviates significantly is inverse polynomial in the size of the whole system

$P(X>(1+\delta)\mu) < \frac{1}{N^{\delta}}.$

This is important since the total number of events is polynomial in $N$ for many settings. By simple Union Bound,

$P(\mbox{any } X>(1+\delta)\mu) = O(\frac{1}{N}).$

So the chance of any deviation in any event is small!

## Upcoming Conferences

We’re just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling.

## Variational Inference (part 1)

[latexpage]

I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic.

The goal of variational inference is to approximate an intractable probability distribution, $p$, with a tractable one, $q$, in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.

## Learning Theory: What Next?

In my previous post, I wrote about why I find learning theory to be a worthwhile endeavor. In this post I want to discuss a few open/under-explored areas in learning theory that I think are particularly interesting. If you know of progress in these areas, please email me or post in the comments. Continue reading “Learning Theory: What Next?”

I thought it would be fun to quickly demonstrate how to write a stateless stochastic memoizer in Haskell. The idea behind stochastic memoization is to take some base generative process and transform it into another process which, on each sample, either draws from the original process or draws from the distribution of previously drawn samples. This allows you, for example, to introduce memory or smoothness into your sampling distribution [1, 2]. Continue reading “Stochastic memoization in Haskell”

## Data compression and unsupervised learning, Part 2

[latexpage]

This is a continuation of my last post about data compression and machine learning. In this post, I will start to address the question:

Does “good” compression generally lead to “good” unsupervised learning?

What is a “good” compression algorithm?

## An Auxiliary Variable Trick for MCMC

[latexpage]I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form

\pi({\bf x}) \propto \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) L({\bf x}),

where $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ is a multivariate Gaussian prior with mean $\boldsymbol\mu$ and covariance matrix $\boldsymbol\Sigma$, and $L({\bf x})$ is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution $\pi({\bf x})$ as

\pi({\bf x}) = \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) \cdot \frac{\pi({\bf x})}{\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)},

## The Correct Birth/Death Jacobian for Mixture Models

[latexpage]
Reversible jump Markov Chain Monte Carlo (RJMCMC) [1] is an extension of the Metropolis-Hastings algorithm that allows sampling from a distribution over models with potentially different numbers of parameters. In this post we are interested in determining the number of components to use when modeling data with a mixture model. The number of components corresponds to the dimension of the space we are walking through. The point of this post is to clear up a common error seen in the literature involving computing a Jacobian that arises in the algorithm.
Continue reading “The Correct Birth/Death Jacobian for Mixture Models”