A Continuous Approach to Discrete MCMC

Robert Nishihara Machine Learning, Statistics Leave a Comment

[latexpage]Continuous problems are often simpler to solve than discrete problems. This is true in many optimization problems (for instance, linear programming versus integer linear programming). In the case of Markov chain Monte Carlo (MCMC), sampling continuous distributions has some advantages over sampling discrete distributions due to the availability of gradient information in the continuous case. The paper “Continuous Relaxations for Discrete Hamiltonian Monte Carlo” by Yichuan Zhang, Charles Sutton, Amos Storkey, and Zoubin Ghahramani explores the idea of performing inference in the discrete setting by deriving and sampling a related continuous distribution. Here I describe the approach taken in this paper.

We consider binary Markov random fields, i.e. discrete distributions over binary vectors ${\bf s} = (s_1, \ldots, s_N)$ of the form

\begin{align}

p({\bf s}) = \frac{1}{Z} \exp\left( {\bf a}^{\mathsf T} {\bf s} + \frac{1}{2} {\bf s}^{\mathsf T} {\bf W} {\bf s} \right),

\end{align}

where the vector ${\bf a}$ and the matrix ${\bf W}$ parametrize the distribution. The authors actually begin with discrete distributions described by more general undirected graphical models, but they ultimately coerce such distributions into the form of equation (1). Then, we apply the “Gaussian Integral Trick,” in which we define the auxiliary variable ${\bf x} \in \mathbb R^N$ to be Gaussian conditioned on ${\bf s}$. This produces a very manageable joint distribution $p({\bf x}, {\bf s})$. There are multiple ways to do this, but one way is to define something like

\begin{align}

p({\bf x} \, | \, {\bf s}) = \mathcal N({\bf x}; {\bf W}{\bf s}, {\bf W}) .

\end{align}

In practice, ${\bf W}$ may not be positive definite, so you have to add a diagonal matrix to it, but equation (2) contains the essence of the trick. Expanding and simplifying the joint distribution, we get

\begin{align}

p({\bf x},{\bf s}) = p({\bf x}\,|\,{\bf s})\, p({\bf s}) \propto \exp\left( -\frac{1}{2}  {\bf x}^{\mathsf T}{\bf W}^{-1}{\bf x} + {\bf s}^{\mathsf T} {\bf x} + {\bf a}^{\mathsf T} {\bf s} \right) .

\end{align}

Things simplify so that the cross terms from ${\bf s}^{\mathsf T}W{\bf s}$, which were present in equation (1), disappear in the joint distribution. Each component ${\bf s}_i$ is a binary variable, so summing out each ${\bf s}_i$, we are left with the continuous marginal

\begin{align*}

p({\bf x}) \propto \exp\left( – \frac{1}{2} {\bf x}^{\mathsf T}{\bf W}^{-1}{\bf x} \right) \prod_i \left(1+\exp\left( {\bf x}_i + {\bf a}_i \right) \right).

\end{align*}

Thus, we have produced a continuous distribution, $p({\bf x})$, from the discrete distribution $p({\bf s})$. These distributions are intimately linked. For instance, the distribution over ${\bf x}$ can be thought of as a mixture of $2^N$ Gaussians, one for each state ${\bf s}$. Now, we can use continuous MCMC methods to sample $p({\bf x})$. The authors use Hamiltonian Monte Carlo because of the ready availability of gradient information, but it is likely that other powerful methods for sampling continuous distributions can be exploited for this task.

Once we have generated samples ${\bf x}^{(1)}, \ldots, {\bf x}^{(M)}$ from $p({\bf x})$, they can be used to perform standard inference tasks like computing the marginals of $p({\bf s})$ or evaluating the normalizing constant $Z$ from equation (1). To illustrate the first example, observe that we can write the marginal distribution of the component ${\bf s}_i$ as

\begin{align*}

p({\bf s}_i) = \int p({\bf s}_i \,|\, {\bf x}) \, p({\bf x}) \, d{\bf x} \approx \frac{1}{M} \sum_{i=1}^M p({\bf s}_i \, |\, {\bf x}^{(i)}).

\end{align*}

The conditional $p({\bf s}_i \, |\, {\bf x}^{(i)})$ can be computed easily from equation (3). This approach bypasses the need to ever produce samples from $p({\bf s})$, so that we may focus our attention on methods in the continuous domain.

Leave a Reply