# Pseudo-marginal MCMC

This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi(\mathbf{x})$, with $\mathbf{x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state $\mathbf{x}$ to a proposed state $\mathbf{x}'$ with probability $\min(1, \pi(\mathbf{x}')/\pi(\mathbf{x}))$ .

But what if we cannot evaluate $\pi(\mathbf{x})$ exactly? Such a situation might arise if we are given a joint density function $\pi(\mathbf{x}, \mathbf{z})$, with $\mathbf{z} \in \mathcal Z$, and we must marginalize out $\mathbf{z}$ in order to compute $\pi(\mathbf{x})$. In this situation, we may only be able to approximate

$$\pi(\mathbf{x}) = \int \pi(\mathbf{x},\mathbf{z}) \, \mathrm d\mathbf{z} ,$$

for instance with importance sampling. If we draw i.i.d. variables $\{\mathbf{z}_1, \ldots, \mathbf{z}_N\}$ from the distribution the density function $q(\mathbf{z})$, then our importance sampling estimate will be

$$\tilde{\pi}(\mathbf{x}) = \frac{1}{N} \sum_{n=1}^{N} \frac{\pi(\mathbf{x}, \mathbf{z}_n)}{q(\mathbf{z}_n)} .$$

What happens if we go ahead and run Metropolis-Hastings on the estimated density function $\tilde{\pi}(\mathbf{x})$? If we generate new variables $\{\mathbf{z}_1, \ldots, \mathbf{z}_N\}$ once for each proposal, then we can view this as performing a random walk through $\mathcal X \times \mathcal Z^N$. To be clear, if we are at the current state $(\mathbf{x},\mathbf{z}) = (\mathbf{x}, \mathbf{z}_1, \ldots, \mathbf{z}_N)$, we propose the state $(\mathbf{x}',\mathbf{z}') = ({\bf x'}, \mathbf{z}_1', \ldots, \mathbf{z}_N')$ by drawing ${\bf x'}$ from the original spherical proposal distribution centered on $\mathbf{x}$ and by drawing each $\mathbf{z}_n'$ from $q(\mathbf{z})$. We then accept the proposed state with probability

# Chernoff's bound

If you have some randomness in your life, chances are that you want to try Chernoff's bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff's bound or its cousins).

My favorite form of Chernoff's bound is: for $X_1, ..., X_n$ independent binary random variables, and $X=\sum_{i=1}^n X_i$, $\mu = E[X]$ and $\delta < 2e-1$, then

# Upcoming Conferences

We're just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling.

# Variational Inference (part 1)

I will dedicate the next few posts to variational inference methods as a way to organize my own understanding - this first one will be pretty basic.

The goal of variational inference is to approximate an intractable probability distribution, $p$, with a tractable one, $q$, in a way that makes them as 'close' as possible. Let's unpack that statement a bit.

1. Intractable $p$: a motivating example is the posterior distribution of a Bayesian model, i.e. given some observations $x = (x_1, x_2, \dots, x_n)$ and some model $p(x | \theta)$ parameterized by $\theta = (\theta_1, \dots, \theta_d)$, we often want to evaluate the distribution over parameters
\begin{align*}
p(\theta | x) = \frac{ p(x | \theta) p(\theta) }{ \int p(x | \theta) p(\theta) d \theta }
\end{align*}
For a lot of interesting models this distribution is intractable to deal with because of the integral in the denominator. We can evaluate the posterior up to a constant, but we can't compute the normalization constant. Applying variational inference to posterior distributions is sometimes called variational Bayes. </p>
2. A tractable posterior distribution is one for which we can evaluate the integral (and therefore take expectations with it). One way to achieve this is by making each $\theta_i$ independent:
\begin{align*}
q(\theta | \lambda) = \prod_{i=1}^d q_i(\theta_i | \lambda_i)
\end{align*}
for some marginal distributions $q_i$ parameterized by $\lambda_i$. This is called the 'mean field' approximation. </p>
3. We can make two distributions 'close' in the Kullback-Leibler sense:
\begin{align*}
KL(q || p) &\equiv \int q(\theta) \log \frac{ q(\theta) }{ p(\theta | x) } d\theta \\
&= E_q \left(\log \frac{ q(\theta) }{ p(\theta | x) } \right)
\end{align*}
Notice that we chose $KL(q||p)$ and not $KL(p||q)$ because of the intractability - we're assuming you cannot evaluate $E_p(\cdot)$.

In order to minimize the KL divergence $KL(q||p)$, note the following decomposition
\begin{align*}
\log p(x) &= \log \frac{ p(\theta | x) p(x) }{ p(\theta | x)} \\
&= \log \frac{ q(\theta) }{ p(\theta | x) } + \log \frac{p(x, \theta)}{q(\theta)} \\
E_q(\log p(x)) &= E_q \left( \log \frac{ q(\theta) }{ p(\theta | x) } + \log \frac{p(x, \theta)}{q(\theta)} \right) \\
\log p(x) &= KL(q || p) + E_q \left( \log \frac{p(x, \theta)}{q(\theta)} \right)
\end{align*}

where $\log p(x)$ falls out of the expectation with respect to $q(\theta)$. This is convenient because $\log p(x)$ (the log evidence) is going to be constant with respect to the variational distribution $q$ and the model parameters $\theta$.

And because KL divergence is strictly nonnegative, the second term $E_q \left( \log \frac{p(x, \theta)}{q(\theta)} \right)$ is a lower bound for $\log p(x)$, also known as the evidence lower bound (ELBO). In order to minimize the first term, $KL(q || p)$, it suffices to maximize the second.

One way to optimize over the choice of $q(\theta) = \prod_i q_i(\theta_i)$ is to consider the ELBO with respect some $q_j(\theta_j)$, separating it from the expectation with respect to all other variables, $E_{-q_j}(\cdot)$ (note that $\int_{-\theta_j}$ is the integral over all $\theta_i$ with $i \neq j$):
\begin{align*}
ELBO(q_j) &= \int \prod_i q_i(\theta_i) \left( \log p(x, \theta) - \sum_k \log q_k(\theta_k) \right)\\
&= \int_{\theta_j} q_j(\theta_j) \int_{-\theta_j} \prod_{i \neq j} q_i(\theta_i) \times \\
&~~ \left( \log p(x, \theta) - \sum_k \log q_k(\theta_k) \right) \\
&= \int_{\theta_j} q_j(\theta_j) E_{-q_j} [ \log p(x, \theta) ] \\
&~~ - \int_{\theta_j} q_j(\theta_j) \log q_j(\theta_j) + const. \\
&= -KL(q_j || \tilde{q}_j) + const.
\end{align*}
where
\begin{align*}
\tilde{q_j} \propto \exp\left( E_{-q_j}[\log p(x,\theta)] \right)
\end{align*}
This motivates a particular updating scheme: iterate over the marginals $q_j$, maximizing the ELBO at each step with $\tilde{q}_j$, and repeat.

These statements remain pretty general. We haven't specified the functional form of $q_j(\theta_j)$, but it will fall out of the $E_{-q_j}[\log p(x,\theta)]$ for specific models (a good, simple example is the normal-gamma model from Murphy's MLAPP or Bishop's PRML). This form will then define the variational parameters $\lambda_i$, and the iterative algorithm will provide a way to compute them.

Some fun properties of the mean field approximation:

• The optimization function is not guaranteed to be convex (wainwright and jordan)
• The optimization procedure is pretty much out of the box for exponential family models (maybe a future blog post will be dedicated to exploring this)
• This variational approximation underestimates uncertainty (a consequence that pops out as a result of the KL divergence ordering, $KL(q || p)$ as opposed to $KL(p || q)$).

This begs the question, can you do better with richer $q$ distributions? Yes! In future posts, I will take a look at some more complicated variational distributions and their performance.

references:

1. Kevin Murphy. Machine learning: A probabilistic perspective
2. Wainwright and Jordan. Graphical Models, Exponential Families, and Variational Inference

# Geometric means of distributions

Annealed importance sampling [1] is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I'm not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.

When we use AIS, we have some (unnormalized) probability distribution $p(x)$ we're interested in sampling from, such as the posterior of a probabilistic model given observations. We also have a distribution $q(x)$ which is easy to sample from, such as the prior. We need to choose a sequence of unnormalized distributions $f_t(x)$ indexed a continuous parameter $t \in [0, 1]$ which smoothly interpolate between $q$ and $p$. (I use $f_t$ because later on, $p$ and $q$ will correspond to directed graphical models, but the intermediate distributions will not.)

A common choice is to take a weighted geometric mean of the two distributions. That is, $f_t(x) = q(x)^{1-t} p(x)^t$. This is motivated by examples like the following, where we gradually "anneal" from a uniform distribution to a multimodal one.

# Learning Theory: What Next?

In my previous post, I wrote about why I find learning theory to be a worthwhile endeavor. In this post I want to discuss a few open/under-explored areas in learning theory that I think are particularly interesting. If you know of progress in these areas, please email me or post in the comments.

1. Humans seem to operate well with very little data. What assumptions are we as humans making that differ from those in learning theory? These assumptions are probably something stronger than distribution-independent (e.g. in traditional PAC learning and statistical learning theory) but much weaker than assuming uniform distribution (which is, for example, studied in the PAC framework). Defining such in-between, "natural" distributions is very challenging. Luckily, those in the computational complexity theory community deal with these issues, so some very smart people have thought about this. Complexity theorists have defined various notions of "average case complexity" and the concept of "smoothed analysis," which attempts to eliminate "pathological cases" (essentially those only occurring when operating with infinite precision). A big triumph of smoothed analysis was showing that the simplex algorithm for solving linear programs is runs in polynomial time under reasonable smoothing conditions. It would be interesting to develop similar results for learning, perhaps showing that its possible to learn (up to necessary error), certain classes of noisified concepts

2. There is a lot of theory for supervised learning and quite a bit for online learning and density estimation. However, there isn't much theory for unsupervised/representation learning* (though see, e.g., Nina Balcan's thesis). Indeed, as Roger has discussed, it's hard to even formulate the problem we are solving in the unsupervised setting.

3. There are various results in Bayesian theory regarding consistency and convergence of Bayesian inference. There is also an area somewhat within statistical learning theory called PAC-Bayesian theory. PAC-Bayes theory provides some very tight generalization bounds, applicable in supervised and unsupervised learning, density estimation, and reinforcement learning. It's a beautiful area of work which integrates the Bayesian concept of having a prior distribution over the hypothesis class, which neither computational learning theory nor standard statistical learning do. But while it has a Bayesian flavor, it's not strictly concerned with the classical conception of Bayesian inference. It would be interesting to have a "Bayesian learning theory", though it's possible such a thing is doable using existing PAC-Bayesian results.**

* Now obligatory disclaimer regarding the subtly of this distinction, with a link to Roger's post.

** I am not a PAC-Bayes guru by any means, so any corrections to this portion from an expert are appreciated.

crpMem g0 alpha f = go	g0 []