An Auxiliary Variable Trick for MCMC

Robert Nishihara Machine Learning, Probability, Recent work 1 Comment

[latexpage]I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form

\begin{equation}

\pi({\bf x}) \propto \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) L({\bf x}),

\end{equation}

where $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ is a multivariate Gaussian prior with mean $\boldsymbol\mu$ and covariance matrix $\boldsymbol\Sigma$, and $L({\bf x})$ is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution $\pi({\bf x})$ as

\begin{equation}

\pi({\bf x}) = \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) \cdot \frac{\pi({\bf x})}{\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)},

\end{equation}

for any Gaussian $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$. In this setting, $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ won’t be a prior, and $\frac{\pi({\bf x})}{\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)}$ won’t be a likelihood function, but this is enough to apply elliptical slice sampling.

Will this work well? Probably not. Elliptical slice sampling works because it makes use of the structure induced by the Gaussian prior. For an arbitrarily chosen Gaussian, this won’t be the case. On the other hand, if we choose a Gaussian that closely approximates $\pi({\bf x})$, then elliptical slice sampling will probably work well.

This reasoning motivates us to build a Gaussian approximation to $\pi({\bf x})$. The better our approximation, the better we can expect the sampling algorithm to work. Gaussians all have light tails (the density function diminishes exponentially  as ${\bf x} \to \infty$). So, in order to give ourselves more flexibility, we broaden our class of approximations to the class of multivariate $t$ distributions. A multivariate $t$ distribution with parameters $\nu$, ${\boldsymbol\mu}$, and ${\boldsymbol\Sigma}$ (referred to as the degrees of freedom, mean, and covariance parameters respectively) can be written (somewhat suggestively) as

\begin{equation}

\mathcal T_{\nu} ({\bf x}; \boldsymbol\mu, \boldsymbol\Sigma) = \int_0^{\infty} \textnormal{IG}(s;\tfrac{\nu}{2},\tfrac{\nu}{2}) \, \mathcal N({\bf x}, \boldsymbol\mu, s\boldsymbol\Sigma) \, \mathrm{d}s ,

\end{equation}

where $\textnormal{IG}(s;\alpha, \beta)$ is the density function of an inverse-gamma distribution. Now, using a multivariate $t$ approximation, we can write our original density function as

\begin{equation}

\pi({\bf x}) = \mathcal T_{\nu}({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) \cdot \frac{\pi({\bf x})}{\mathcal T_{\nu}({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)} .

\end{equation}

Packaging up $\frac{\pi({\bf x})}{\mathcal T_{\nu}({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)}$ as a likelihood function $L({\bf x})$, we are left with the task of generating samples from the probability distribution given by

\begin{equation}

\pi({\bf x}) = \int_0^{\infty} L({\bf x}) \, \textnormal{IG}(s;\tfrac{\nu}{2},\tfrac{\nu}{2}) \, \mathcal N({\bf x}, \boldsymbol\mu, s\boldsymbol\Sigma) \, \mathrm{d}s .

\end{equation}

How can this be done? Here’s the trick. Define a joint distribution

\begin{equation}

p({\bf x},s) = L({\bf x}) \, \textnormal{IG}(s;\tfrac{\nu}{2},\tfrac{\nu}{2}) \, \mathcal N({\bf x}, \boldsymbol\mu, s\boldsymbol\Sigma) ,

\end{equation}

so we have

\begin{equation}

\pi({\bf x}) = \int_0^{\infty} p({\bf x},s) \, \mathrm{d}s .

\end{equation}

This tells us that the marginal distribution of ${\bf x}$ in $p({\bf x},s)$ is our target distribution $\pi({\bf x})$. Therefore, to generate samples from $\pi({\bf x})$, it suffices to generate samples from $p({\bf x},s)$ and then to disregard the $s$ coordinate. In order to generate samples from $\pi({\bf x})$, we can alternately sample the conditional distributions

\begin{equation}

p({\bf x} \,|\, s) \propto L({\bf x}) \,  \mathcal N({\bf x}, \boldsymbol\mu, s\boldsymbol\Sigma)

\end{equation}

and

\begin{equation}

p(s \,|\, {\bf x}) \propto \textnormal{IG}(s;\tfrac{\nu}{2},\tfrac{\nu}{2}) \, \mathcal N({\bf x}, \boldsymbol\mu, s\boldsymbol\Sigma) .

\end{equation}

The first conditional can be sampled with elliptical slice sampling. The second conditional is still an inverse gamma distribution (because the inverse gamma turns out to be a conjugate prior), and so it can be sampled exactly. This approach allows us to generalize elliptical slice sampling to arbitrary continuous distributions.

Comments 1

  1. Looks great. Are there certain kinds of continuous distributions that you think will work better than others with this method? What ultimately determines the performance you can expect from this scheme vs some other ‘advanced’ MCMC method (HMC, etc.)?

Leave a Reply