[latexpage]I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form \begin{equation} \pi({\bf x}) \propto \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) L({\bf x}), \end{equation} where $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ is a multivariate Gaussian prior with mean $\boldsymbol\mu$ and covariance matrix $\boldsymbol\Sigma$, and $L({\bf x})$ is a likelihood function. Suppose we want to generalize this algorithm to sample …

## A Geometric Intuition for Markov’s Inequality

[latexpage] As the title of the post suggests, this week I will discuss a geometric intuition for Markov’s inequality, which for a nonnegative random variable, $X$, states $$ P(X \geq a) \leq E[X]/a. $$ This is a simple result in basic probability that still felt surprising every time I used it… until very recently. (Warning: Basic measure theoretic probability lies ahead. These notes look like they provide sufficient background if this post is confusing and you are sufficiently motivated!)

## Correlation and Mutual Information

[latexpage] Mutual information is a quantification of the dependency between random variables. It is sometimes contrasted with linear correlation since mutual information captures nonlinear dependence. In this short note I will discuss the relationship between these quantities in the case of a bivariate Gaussian distribution, and I will explore two implications of that relationship.

## Aversion of Inversion

[latexpage] In the spirit of Ryan’s most recent post, I will discuss a fundamental snippet from numerical linear algebra that facilitates computation for the same price of not facilitating it. In our everyday lives, we often come across theoretical expressions that involve matrix inverses stapled to vectors, such as $\Omega^{-1}\mathbf{x}$ with $\Omega\in\mathbb{R}^{n\times n}, \mathbf{x}\in\mathbb{R}^n$. When we proceed to code this up, it is very tempting to first compute $\Omega^{-1}$. Resist doing this! There are several points for why there is no point to actually find an explicit, tangible inverse.

## Computing Log-Sum-Exp

[latexpage]This post is about a computational trick that everyone should know, but that doesn’t tend to be explicitly taught in machine learning courses. Imagine that we have a set of $N$ values, $\{x_n\}^N_{n=1}$ and we want to compute the quantity \begin{align} z = \log \sum_{n=1}^N \exp\{x_n\}. \end{align} This comes up all the time when you want to parameterize a multinomial distribution using a softmax, e.g., when doing logistic regression and you have more than two unordered categories. If you want to compute the log likelihood, you’ll find such an expression due to the normalization constant. Computing this naively can be a recipe for disaster, due to underflow or overflow, depending on the scale of the $x_n$. Consider a simple example, …