The Central Limit Theorem

[latexpage]The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables

\begin{equation}

X_1, X_2, \ldots .

\end{equation}

Define the random variable

\begin{equation}

A_N = \frac{X_1 + \cdots + X_N}{\sqrt{N}}

\end{equation}

to be a scaled sum of the first $N$ variables in the sequence. Now, we would like to make interesting statements about the sequence

\begin{equation}

A_1, A_2, \ldots .

\end{equation} Continue reading “The Central Limit Theorem”

Optimal Spatial Prediction with Kriging

[latexpage]Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function $Z$ at some locations ${\bf s}_1, \ldots, {\bf s}_N$, and we’d like to predict the function’s value at some new location ${\bf s}_0$. Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case.

I will assume that $Z$ is an intrinsically stationary process. In other words, there exists some semivariogram $\gamma({\bf h})$ such that

\[ \text{var}[Z({\bf s}+{\bf h}) – Z({\bf s})] = 2\gamma({\bf h}) . \]

Furthermore, I will assume that the process is isotropic, (i.e. that $\gamma({\bf h})$ is a function only of $||h||$). As Andy described here, the existence of a covariance function implies intrinsic stationarity. In addition, I will assume that the process has a constant mean, $\mathbb E[Z({\bf s})] = \mu$. We would like to estimate $Z({\bf s})$ with a linear combination of our current observations. Our estimator will be Continue reading “Optimal Spatial Prediction with Kriging”

Fisher information

[latexpage]

I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about:

\begin{align*}
{\bf F}_\theta &= {\mathbb E}_x[\nabla_\theta \log p(x;\theta) (\nabla_\theta \log p(x;\theta))^T] \\
&= {\rm Cov}_x[ \nabla_\theta \log p(x;\theta) ] \\
&= {\mathbb E}_x[ -\nabla^2_\theta \log p(x; \theta) ].
\end{align*}

It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on.

This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative of KL divergence. Continue reading “Fisher information”

Pseudo-marginal MCMC

[latexpage]This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi({\bf x})$, with ${\bf x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state ${\bf x}$ to a proposed state ${\bf x}’$ with probability $\min(1, \pi({\bf x}’)/\pi({\bf x}))$ .

But what if we cannot evaluate $\pi({\bf x})$ exactly? Such a situation might arise if we are given a joint density function $\pi({\bf x}, {\bf z})$, with ${\bf z} \in \mathcal Z$, and we must marginalize out ${\bf z}$ in order to compute $\pi({\bf x})$. In this situation, we may only be able to approximate

\[ \pi({\bf x}) = \int \pi({\bf x},{\bf z}) \, \mathrm d{\bf z} ,\] Continue reading “Pseudo-marginal MCMC”

Variational Inference (part 1)

[latexpage]

I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic.

The goal of variational inference is to approximate an intractable probability distribution, $p$, with a tractable one, $q$, in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.

Continue reading “Variational Inference (part 1)”

The Alias Method: Efficient Sampling with Many Discrete Outcomes

[latexpage]
When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter $\boldsymbol{\pi}\in\mathbb{R}^K$, such that $\pi_k\geq 0$ and $\sum_{k}\pi_k=1$. A somewhat more common occurrence is that we have a $\boldsymbol{\phi}\in\mathbb{R}^K$ where $\phi_k\geq 0$, but we don’t know the normalization constant. That is, our $\boldsymbol{\phi}$ is only proportional to the multinomial parameter $\boldsymbol{\pi}$. We want to rapidly generate a variate according to $\boldsymbol{\pi}$, given $\boldsymbol{\pi}$, something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox):

cdf = cumsum(phi);
samp_k = sum(cdf < rand()*cdf(end)) + 1;

This is nice and simple, but you'll notice that it has $\mathcal{O}(K)$ time complexity for setup (computing the CDF) and $\mathcal{O}(K)$ time complexity per sample. The per-sample complexity could probably be reduced to $\mathcal{O}(\log K)$ with a better data structure for finding the threshold. It turns out, however, that we can do better and get $\mathcal{O}(1)$ for the sampling, while still being $\mathcal{O}(K)$.

Continue reading "The Alias Method: Efficient Sampling with Many Discrete Outcomes"

High-Dimensional Probability Estimation with Deep Density Models

[latexpage]

Ryan Adams and I just uploaded to the arXiv our paper “High-Dimensional Probability Estimation with Deep Density Models”. In this work, we introduce the deep density model (DDM), a new approach for density estimation. Continue reading “High-Dimensional Probability Estimation with Deep Density Models”

A Parallel Gamma Sampling Implementation

[latexpage]

I don’t have a favorite distribution, but if I had to pick one, I’d say the gamma.  Why not the Gaussian? Because everyone loves the Gaussian! But when you want a prior distribution for the mean of your Poisson, or the variance of your Normal, who’s there to pick up the mess when the Gaussian lets you down? The gamma. When you’re trying to actually sample that Dirichlet that makes such a nice prior distribution for categorical distributions over your favorite distribution (how about that tongue twister), who’s there to help you?  You guessed it, the gamma. But if you want a distribution that you can sample millions of times during each iteration of your MCMC algorithm, well, now the Gaussian is looking pretty good, but let’s not give up hope on the gamma just yet.

Continue reading “A Parallel Gamma Sampling Implementation”

Exponential Families and Maximum Entropy

[latexpage]An exponential family parametrized by $\boldsymbol\theta \in \mathbb R^d$ is the set of probability distributions that can be expressed as

\[ p({\bf x} \,|\, \boldsymbol\theta) =\frac{1}{Z(\boldsymbol\theta)} h({\bf x}) \exp\left( \boldsymbol\theta^{\mathsf T}\boldsymbol\phi({\bf x}) \right) ,\]

for given functions $Z(\boldsymbol\theta)}$ (the partition function), $h({\bf x})$, and $\boldsymbol\phi({\bf x})$ (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified. Continue reading “Exponential Families and Maximum Entropy”

Correlation and Mutual Information

[latexpage]

Mutual information is a quantification of the dependency between random variables. It is sometimes contrasted with linear correlation since mutual information captures nonlinear dependence. In this short note I will discuss the relationship between these quantities in the case of a bivariate Gaussian distribution, and I will explore two implications of that relationship.
Continue reading “Correlation and Mutual Information”