Pseudo-marginal MCMC

This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi(\mathbf{x})$, with $\mathbf{x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state $\mathbf{x}$ to a proposed state $\mathbf{x}'$ with probability $\min(1, \pi(\mathbf{x}')/\pi(\mathbf{x}))$ .

But what if we cannot evaluate $\pi(\mathbf{x})$ exactly? Such a situation might arise if we are given a joint density function $\pi(\mathbf{x}, \mathbf{z})$, with $\mathbf{z} \in \mathcal Z$, and we must marginalize out $\mathbf{z}$ in order to compute $\pi(\mathbf{x})$. In this situation, we may only be able to approximate

$$ \pi(\mathbf{x}) = \int \pi(\mathbf{x},\mathbf{z}) \, \mathrm d\mathbf{z} ,$$

for instance with importance sampling. If we draw i.i.d. variables $\{\mathbf{z}_1, \ldots, \mathbf{z}_N\}$ from the distribution the density function $q(\mathbf{z})$, then our importance sampling estimate will be

$$ \tilde{\pi}(\mathbf{x}) = \frac{1}{N} \sum_{n=1}^{N} \frac{\pi(\mathbf{x}, \mathbf{z}_n)}{q(\mathbf{z}_n)} . $$

What happens if we go ahead and run Metropolis-Hastings on the estimated density function $\tilde{\pi}(\mathbf{x})$? If we generate new variables $\{\mathbf{z}_1, \ldots, \mathbf{z}_N\}$ once for each proposal, then we can view this as performing a random walk through $\mathcal X \times \mathcal Z^N$. To be clear, if we are at the current state $(\mathbf{x},\mathbf{z}) = (\mathbf{x}, \mathbf{z}_1, \ldots, \mathbf{z}_N)$, we propose the state $(\mathbf{x}',\mathbf{z}') = ({\bf x'}, \mathbf{z}_1', \ldots, \mathbf{z}_N')$ by drawing ${\bf x'}$ from the original spherical proposal distribution centered on $\mathbf{x}$ and by drawing each $\mathbf{z}_n'$ from $q(\mathbf{z})$. We then accept the proposed state with probability

Read More

Chernoff's bound

If you have some randomness in your life, chances are that you want to try Chernoff's bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff's bound or its cousins).

My favorite form of Chernoff's bound is: for $X_1, ..., X_n$ independent binary random variables, and $X=\sum_{i=1}^n X_i$, $\mu = E[X]$ and $\delta < 2e-1$, then

Read More

Upcoming Conferences

We're just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling.

Read More

Variational Inference (part 1)

I will dedicate the next few posts to variational inference methods as a way to organize my own understanding - this first one will be pretty basic.

The goal of variational inference is to approximate an intractable probability distribution, $p$, with a tractable one, $q$, in a way that makes them as 'close' as possible. Let's unpack that statement a bit.

  1. Intractable $p$: a motivating example is the posterior distribution of a Bayesian model, i.e. given some observations $x = (x_1, x_2, \dots, x_n)$ and some model $p(x | \theta)$ parameterized by $\theta = (\theta_1, \dots, \theta_d)$, we often want to evaluate the distribution over parameters
    \begin{align*}
    p(\theta | x) = \frac{ p(x | \theta) p(\theta) }{ \int p(x | \theta) p(\theta) d \theta }
    \end{align*}
    For a lot of interesting models this distribution is intractable to deal with because of the integral in the denominator. We can evaluate the posterior up to a constant, but we can't compute the normalization constant. Applying variational inference to posterior distributions is sometimes called variational Bayes. </p>
  2. A tractable posterior distribution is one for which we can evaluate the integral (and therefore take expectations with it). One way to achieve this is by making each $\theta_i$ independent:
    \begin{align*}
    q(\theta | \lambda) = \prod_{i=1}^d q_i(\theta_i | \lambda_i)
    \end{align*}
    for some marginal distributions $q_i$ parameterized by $\lambda_i$. This is called the 'mean field' approximation. </p>
  3. We can make two distributions 'close' in the Kullback-Leibler sense:
    \begin{align*}
    KL(q || p) &\equiv \int q(\theta) \log \frac{ q(\theta) }{ p(\theta | x) } d\theta \\
    &= E_q \left(\log \frac{ q(\theta) }{ p(\theta | x) } \right)
    \end{align*}
    Notice that we chose $KL(q||p)$ and not $KL(p||q)$ because of the intractability - we're assuming you cannot evaluate $E_p(\cdot)$.

In order to minimize the KL divergence $KL(q||p)$, note the following decomposition
\begin{align*}
\log p(x) &= \log \frac{ p(\theta | x) p(x) }{ p(\theta | x)} \\
&= \log \frac{ q(\theta) }{ p(\theta | x) } + \log \frac{p(x, \theta)}{q(\theta)} \\
E_q(\log p(x)) &= E_q \left( \log \frac{ q(\theta) }{ p(\theta | x) } + \log \frac{p(x, \theta)}{q(\theta)} \right) \\
\log p(x) &= KL(q || p) + E_q \left( \log \frac{p(x, \theta)}{q(\theta)} \right)
\end{align*}

where $\log p(x)$ falls out of the expectation with respect to $q(\theta)$. This is convenient because $\log p(x)$ (the log evidence) is going to be constant with respect to the variational distribution $q$ and the model parameters $\theta$.

And because KL divergence is strictly nonnegative, the second term $E_q \left( \log \frac{p(x, \theta)}{q(\theta)} \right)$ is a lower bound for $\log p(x)$, also known as the evidence lower bound (ELBO). In order to minimize the first term, $KL(q || p)$, it suffices to maximize the second.

One way to optimize over the choice of $q(\theta) = \prod_i q_i(\theta_i)$ is to consider the ELBO with respect some $q_j(\theta_j)$, separating it from the expectation with respect to all other variables, $E_{-q_j}(\cdot)$ (note that $\int_{-\theta_j}$ is the integral over all $\theta_i$ with $i \neq j$):
\begin{align*}
ELBO(q_j) &= \int \prod_i q_i(\theta_i) \left( \log p(x, \theta) - \sum_k \log q_k(\theta_k) \right)\\
&= \int_{\theta_j} q_j(\theta_j) \int_{-\theta_j} \prod_{i \neq j} q_i(\theta_i) \times \\
&~~ \left( \log p(x, \theta) - \sum_k \log q_k(\theta_k) \right) \\
&= \int_{\theta_j} q_j(\theta_j) E_{-q_j} [ \log p(x, \theta) ] \\
&~~ - \int_{\theta_j} q_j(\theta_j) \log q_j(\theta_j) + const. \\
&= -KL(q_j || \tilde{q}_j) + const.
\end{align*}
where
\begin{align*}
\tilde{q_j} \propto \exp\left( E_{-q_j}[\log p(x,\theta)] \right)
\end{align*}
This motivates a particular updating scheme: iterate over the marginals $q_j$, maximizing the ELBO at each step with $\tilde{q}_j$, and repeat.

These statements remain pretty general. We haven't specified the functional form of $q_j(\theta_j)$, but it will fall out of the $E_{-q_j}[\log p(x,\theta)]$ for specific models (a good, simple example is the normal-gamma model from Murphy's MLAPP or Bishop's PRML). This form will then define the variational parameters $\lambda_i$, and the iterative algorithm will provide a way to compute them.

Some fun properties of the mean field approximation:

  • The optimization function is not guaranteed to be convex (wainwright and jordan)
  • The optimization procedure is pretty much out of the box for exponential family models (maybe a future blog post will be dedicated to exploring this)
  • This variational approximation underestimates uncertainty (a consequence that pops out as a result of the KL divergence ordering, $KL(q || p)$ as opposed to $KL(p || q)$).

This begs the question, can you do better with richer $q$ distributions? Yes! In future posts, I will take a look at some more complicated variational distributions and their performance.

references:

  1. Kevin Murphy. Machine learning: A probabilistic perspective
  2. Wainwright and Jordan. Graphical Models, Exponential Families, and Variational Inference
Read More

Geometric means of distributions

Annealed importance sampling [1] is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I'm not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.

When we use AIS, we have some (unnormalized) probability distribution $p(x)$ we're interested in sampling from, such as the posterior of a probabilistic model given observations. We also have a distribution $q(x)$ which is easy to sample from, such as the prior. We need to choose a sequence of unnormalized distributions $f_t(x)$ indexed a continuous parameter $t \in [0, 1]$ which smoothly interpolate between $q$ and $p$. (I use $f_t$ because later on, $p$ and $q$ will correspond to directed graphical models, but the intermediate distributions will not.)

A common choice is to take a weighted geometric mean of the two distributions. That is, $f_t(x) = q(x)^{1-t} p(x)^t$. This is motivated by examples like the following, where we gradually "anneal" from a uniform distribution to a multimodal one.

Read More

Learning Theory: What Next?

In my previous post, I wrote about why I find learning theory to be a worthwhile endeavor. In this post I want to discuss a few open/under-explored areas in learning theory that I think are particularly interesting. If you know of progress in these areas, please email me or post in the comments.

1. Humans seem to operate well with very little data. What assumptions are we as humans making that differ from those in learning theory? These assumptions are probably something stronger than distribution-independent (e.g. in traditional PAC learning and statistical learning theory) but much weaker than assuming uniform distribution (which is, for example, studied in the PAC framework). Defining such in-between, "natural" distributions is very challenging. Luckily, those in the computational complexity theory community deal with these issues, so some very smart people have thought about this. Complexity theorists have defined various notions of "average case complexity" and the concept of "smoothed analysis," which attempts to eliminate "pathological cases" (essentially those only occurring when operating with infinite precision). A big triumph of smoothed analysis was showing that the simplex algorithm for solving linear programs is runs in polynomial time under reasonable smoothing conditions. It would be interesting to develop similar results for learning, perhaps showing that its possible to learn (up to necessary error), certain classes of noisified concepts

2. There is a lot of theory for supervised learning and quite a bit for online learning and density estimation. However, there isn't much theory for unsupervised/representation learning* (though see, e.g., Nina Balcan's thesis). Indeed, as Roger has discussed, it's hard to even formulate the problem we are solving in the unsupervised setting.

3. There are various results in Bayesian theory regarding consistency and convergence of Bayesian inference. There is also an area somewhat within statistical learning theory called PAC-Bayesian theory. PAC-Bayes theory provides some very tight generalization bounds, applicable in supervised and unsupervised learning, density estimation, and reinforcement learning. It's a beautiful area of work which integrates the Bayesian concept of having a prior distribution over the hypothesis class, which neither computational learning theory nor standard statistical learning do. But while it has a Bayesian flavor, it's not strictly concerned with the classical conception of Bayesian inference. It would be interesting to have a "Bayesian learning theory", though it's possible such a thing is doable using existing PAC-Bayesian results.**

 

* Now obligatory disclaimer regarding the subtly of this distinction, with a link to Roger's post.

** I am not a PAC-Bayes guru by any means, so any corrections to this portion from an expert are appreciated.

Read More

Stochastic memoization in Haskell

I thought it would be fun to quickly demonstrate how to write a stateless stochastic memoizer in Haskell. The idea behind stochastic memoization is to take some base generative process and transform it into another process which, on each sample, either draws from the original process or draws from the distribution of previously drawn samples. This allows you, for example, to introduce memory or smoothness into your sampling distribution [1, 2].

The code:

crpMem g0 alpha f = go	g0 []
    where go g customers = if	newTable then makeNewTableAndGo else chooseOldTableAndGo
                where (r, g') = randomR (0, 1) g    -- get r in (0,1) and next random gen                                                                                                                   
		      n	= fromIntegral $ length	customers
                      makeNewTableAndGo =  let b = f g'
                                               (_, g'') = next g'
                                           in b : go g'' (b:customers)
		      chooseOldTableAndGo = let (i, g'') = randomR (0, (length customers) - 1) g'
                                                           c = customers !! i
                                            in c : go g'' (c:customers)
                      newTable = r < (alpha / (n + alpha))

The function crpMem takes a random generator, a value alpha greater than 0, and a function of a single argument, a random generator.  This function is the sampler of the base probability distribution. crpMem generates an infinite stream of samples from the Chinese Restaurant Process (3) associated whose table values are sampled from the base probability distribution. The function is relatively straightforward: the loop go takes a new generator and the history of samples (customers) and flips a coin to decide whether to draw uniformly from past samples or draw a new sample. The new sample is tacked onto a recursive call to go with a new random generator and an updated history of samples.

Citations
-------------

1) Goodman, N. D., Mansinghka, V. K., Roy, D., Bonawitz, K., and Tenenbaum, J. B. (2008). Church: A language for generative models. In Uncertainty in Artificial Intelligence, Helsinki, Finland. AUAI Press.

2) Johnson, M., Griffiths, T. L., and Goldwater, S. (2007a). Adaptor Grammars: A framework for specifying compositional nonparametric Bayesian models. In Advances in Neural Information Processing Systems 19, Cambridge, MA. MIT Press.

3) http://en.wikipedia.org/wiki/Chinese_restaurant_process

Read More

Data compression and unsupervised learning, Part 2

This is a continuation of my last post about data compression and machine learning. In this post, I will start to address the question:

Does “good” compression generally lead to “good” unsupervised learning?

To answer this question, we need to start with another question:

What is a “good” compression algorithm?

Sadly, I will have to break this up even further, into two sub-questions:

1) What is a “good” lossless compression algorithm?
2) What is a "good" lossy compression algorithm?

-----------------------------------------------------------------

1) What is a "good" lossless compression algorithm?

The quick and tempting answer is one that achieves Shannon’s entropy bound. However, this requires knowing the the probability distribution of the data. In the types of situations I am discussing, we do not know the underlying distribution of the data. In a sense, if we did, then the problem (both unsupervised learning and compression) would already be solved.

If we do not know the probability distribution of the data, but have access to some data, then we say that a good lossless compression algorithm is one that works well empirically. This does not mean the algorithm that best compresses the available data -- that can lead to overfitting! (For example, each datum could be given an integer code, and then the message would be of size lg N where N is the number of data points, regardless of the size of the data vectors. This compression is useless for generalization to unseen data.) I think the empirical evaluation tools useful here are the same as those used in machine learning: held-out validation sets, etc. My feeling is that, in most practical compression situations, trying to model the density of the examples directly is not likely to be useful (e.g., trying to come up with a normalized density over images). Thus, the answer to question (1) is: whatever algorithm empirically compresses well. Note that we suffered from similar problems to machine learning (like overfitting), but instead of arbitrarily choosing an objective function, we are given one naturally, namely the compression ratio.

2) What is a “good” lossy compression algorithm?

Now things get more complicated. Again I am assuming we do not have access to the data distribution.

In the realm of lossy compression, some human-specified objective functions are needed. In particular, we need:
* a distortion function $D(x)$ that assigns a distortion to an output (given an input)
* a loss function $L(R(x), D(x))$ that assigns a score to each point on the rate-distortion curve (the trade-off between compression ratio and distortion)

Situation 1: We have $D(x)$ and $L(R,D)$
If we have a loss function over the rate-distortion space, then a good lossy compression algorithm is one that minimizes the total loss averaged over plausible inputs. In other words, the evaluation would be the same as in the lossless case, except that instead of trying to minimize bits we try to minimize the loss function that takes into account both size and distortion.

Situation 2: What if we don't have these loss functions?
If we do not have $D(x)$, we really cannot do an honest evaluation. Needless to say, I don't advocate blindly using Euclidean distance as the distortion metric. There are some proxies for $L(R,D) $however. For example we can fix $R$, the compression ratio, and try to minimize $D$, the distortion.

Other approaches to finding a "good" lossy compression algorithm implicitly define these objective functions without stating them outright. I endorse clearly stating one's assumptions, typically in the form of objective functions. As David McKay writes in his book, "you can't do inference - or data compression - without making assumptions." To reiterate, if you know the underlying distribution of the data, then you can attempt to construct an ideal lossless compression algorithm without making assumptions. But in any other (real) situation, assumptions (encoded in these loss functions) are required. Even in the lossless case where you do not need to explicitly choose a loss function, you are still making assumptions. For example the assumption that your validation set, training set, and unseen data come from the same distribution.

This post attempts to discuss the challenges of data compression and how to start thinking about what makes a "good" data compression algorithm. Some ties to machine learning have emerged, such as the issue of overfitting and the necessity for making assumptions (whether explicit or implicit). In the next post I will attempt to tie this back into similarities with unsupervised learning.

Read More

An Auxiliary Variable Trick for MCMC

I recently uploaded the paper "Parallel MCMC with Generalized Elliptical Slice Sampling" to the arXiv. I'd like to highlight one trick that we used, but first I'll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form

Read More

The Correct Birth/Death Jacobian for Mixture Models

Reversible jump Markov Chain Monte Carlo (RJMCMC) [1] is an extension of the Metropolis-Hastings algorithm that allows sampling from a distribution over models with potentially different numbers of parameters. In this post we are interested in determining the number of components to use when modeling data with a mixture model. The number of components corresponds to the dimension of the space we are walking through. The point of this post is to clear up a common error seen in the literature involving computing a Jacobian that arises in the algorithm.
</p>

Unfortunately, simply adding and removing dimensions does not produce a Markov Chain with the correct stationary distribution. The key insight of RJMCMC [1] is dimension matching the models we are jumping between so that the number of dimensions is always the same. This means that we pad the model with fewer parameters with extra variables drawn from some distribution until the dimensions match and perform any necessary transformations of these variables to create the model we are interested in. An important aspect of RJMCMC is that the moves must be reversible, that is if one can move from one component to another, then with positive probability they can make the reverse move.

The simplest type of RJMCMC moves for mixture models are called birth and death moves. Birth moves consist of adding a component and the corresponding parameters and a death moves involves removing the component and the corresponding parameters. A death move is the reverse of a birth move and so the pair of moves will define a valid RJMCMC algorithm. I will focus on the case of a birth move and death moves can be derived similarly. See [1] for details of the algorithm, but we note that in order to compute the Metropolis-Hastings acceptance ratio the probabilities need to be computed on the same space. This requires transforming a probability distribution on one space into another and involves the Jacobian determinant of the parameters of one of the models with respect to those of the other.

For a birth move we introduce a new component and generate its parameters from the prior distribution. Say we are jumping from $K$ components to $K+1$ components. Complications arise when we generate the probability that this component is selected by observations ($w_k$ in the notation of [1]). First we generate a beta random variable for the new component, $w_*$, and rescale the existing component probabilities so that the new set of probabilities sums to $1$. The Jacobian of the transformation from the smaller model to the model with an extra component has $(1-w_*)$ on the diagonal for the $K$ original components and $1$'s on the diagonal for the new component probability and model parameters. Naively, it seems that the determinant is then $(1-w_*)^K$, and many papers and books have been published using this as the determinant. In fact, the original paper had this error [1], which was later noted in [2].

Since the original $w_k$'s formed a probability measure, the last component, $w_K$, is completely determined by all the others. Therefore, it does not show up as a variable that is differentiated with respect to in the Jacobian. The correct Jacobian is therefore $(1-w_*)^{K-1}$. This may seem obvious, but the mistake has been made enough in published papers and books that I think it's worth clearing up.

In practice this change does not seem to affect the results very much [2], though it could be significant for some models. Another issue with using the incorrect Jacobian is that technically the chain converges to the wrong stationary distribution. I think the moral to this post is that mistakes propagate when methods are used as presented without completely understanding them. Don't believe everything you read, and if you're going to use a method make sure you can derive all of the equations.$^*$.

There is much more to RJMCMC including what to actually do with the Jacobian determinant described above. The underlying theory is interesting as well, in particular the general RJMCMC algorithm explains how to do Metropolis-Hastings on general measures. Additionally, the types of moves can be much more clever (and complicated), for instance splitting and merging components. See [1] and [3] for detailed treatments. Both [4] and [5] contain very readable presentations as well (and show the correct Jacobian for birth and death moves).

[1] Richardson and Green, On Bayesian Analysis of Mixtures with an Unknown Number of Components, 1997.
[2] Richardson and Green, Corrigendum: On Bayesian analysis of mixtures with an unknown number of components, 1998.
[3] Handbook of Markov Chain Monte Carlo, Section 1.17.
[4] Marin and Robert, Bayesian Core.
[5] Robert and Casella, Monte Carlo Statistical Methods.

$^*$ As another quick example of mistakes propagating, one of the authors of a paper that used the wrong Jacobian has another paper where he used the reciprocal of the Metropolis-Hastings acceptance probability for his MCMC algorithm. I contacted him about this error and he pointed me to a paper on MCMC that presented the acceptance probability incorrectly. I tried to convince him that this was an error but because of this one reference with a typo I could not sway him.

Read More