An Auxiliary Variable Trick for MCMC

[latexpage]I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form

\begin{equation}

\pi({\bf x}) \propto \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) L({\bf x}),

\end{equation}

where $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ is a multivariate Gaussian prior with mean $\boldsymbol\mu$ and covariance matrix $\boldsymbol\Sigma$, and $L({\bf x})$ is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution $\pi({\bf x})$ as

\begin{equation}

\pi({\bf x}) = \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) \cdot \frac{\pi({\bf x})}{\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)},

\end{equation} Continue reading “An Auxiliary Variable Trick for MCMC”

A Geometric Intuition for Markov’s Inequality

[latexpage]

As the title of the post suggests, this week I will discuss a geometric intuition for Markov’s inequality, which for a nonnegative random variable, $X$, states
$$
P(X \geq a) \leq E[X]/a.
$$
This is a simple result in basic probability that still felt surprising every time I used it… until very recently. (Warning: Basic measure theoretic probability lies ahead. These notes look like they provide sufficient background if this post is confusing and you are sufficiently motivated!)
Continue reading “A Geometric Intuition for Markov’s Inequality”

Correlation and Mutual Information

[latexpage]

Mutual information is a quantification of the dependency between random variables. It is sometimes contrasted with linear correlation since mutual information captures nonlinear dependence. In this short note I will discuss the relationship between these quantities in the case of a bivariate Gaussian distribution, and I will explore two implications of that relationship.
Continue reading “Correlation and Mutual Information”

Aversion of Inversion

[latexpage]

In the spirit of Ryan’s most recent post, I will discuss a fundamental snippet from numerical linear algebra that facilitates computation for the same price of not facilitating it. In our everyday lives, we often come across theoretical expressions that involve matrix inverses stapled to vectors, such as $\Omega^{-1}\mathbf{x}$ with $\Omega\in\mathbb{R}^{n\times n}, \mathbf{x}\in\mathbb{R}^n$. When we proceed to code this up, it is very tempting to first compute $\Omega^{-1}$. Resist doing this! There are several points for why there is no point to actually find an explicit, tangible inverse. Continue reading “Aversion of Inversion”

Computing Log-Sum-Exp

[latexpage]This post is about a computational trick that everyone should know, but that doesn’t tend to be explicitly taught in machine learning courses.  Imagine that we have a set of $N$ values, $\{x_n\}^N_{n=1}$ and we want to compute the quantity

\begin{align}

z = \log \sum_{n=1}^N \exp\{x_n\}.

\end{align}

This comes up all the time when you want to parameterize a multinomial distribution using a softmax, e.g., when doing logistic regression and you have more than two unordered categories.  If you want to compute the log likelihood, you’ll find such an expression due to the normalization constant.  Computing this naively can be a recipe for disaster, due to underflow or overflow, depending on the scale of the $x_n$.  Consider a simple example, with the vector $[0\;1\;0]$.  This seems pretty straightforward, and we get about $1.55$.  Now what about $[1000\;1001\;1000]$.  This seems like it should also be straightforward, but instead our computer gives us back $\inf$.  If we try $[-1000\;-999\;-1000]$ we get $-\inf$.  What’s happening here?  Well, in your typical 64-bit double, $\exp\{1000\}=\inf$ and $\exp\{-1000\}=0$ due to overflow and underflow, respectively.  Even though the log would make the numbers reasonably scaled again with infinite precision, it doesn’t work on a real computer with typical floating point operations.  What to do? Continue reading “Computing Log-Sum-Exp”