Optimal Spatial Prediction with Kriging

[latexpage]Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function $Z$ at some locations ${\bf s}_1, \ldots, {\bf s}_N$, and we’d like to predict the function’s value at some new location ${\bf s}_0$. Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case.

I will assume that $Z$ is an intrinsically stationary process. In other words, there exists some semivariogram $\gamma({\bf h})$ such that

\[ \text{var}[Z({\bf s}+{\bf h}) – Z({\bf s})] = 2\gamma({\bf h}) . \]

Furthermore, I will assume that the process is isotropic, (i.e. that $\gamma({\bf h})$ is a function only of $||h||$). As Andy described here, the existence of a covariance function implies intrinsic stationarity. In addition, I will assume that the process has a constant mean, $\mathbb E[Z({\bf s})] = \mu$. We would like to estimate $Z({\bf s})$ with a linear combination of our current observations. Our estimator will be Continue reading “Optimal Spatial Prediction with Kriging”

Fisher information

[latexpage]

I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about:

\begin{align*}
{\bf F}_\theta &= {\mathbb E}_x[\nabla_\theta \log p(x;\theta) (\nabla_\theta \log p(x;\theta))^T] \\
&= {\rm Cov}_x[ \nabla_\theta \log p(x;\theta) ] \\
&= {\mathbb E}_x[ -\nabla^2_\theta \log p(x; \theta) ].
\end{align*}

It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on.

This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative of KL divergence. Continue reading “Fisher information”

The Gumbel-Max Trick for Discrete Distributions

[latexpage]It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation:
\begin{align*}
\pi_k &= \frac{\exp\{x_k\}}{\sum_{k’=1}^K\exp\{x_{k’}\}}
\end{align*}
where the $x_k$ are unconstrained in $\mathbb{R}$, but the $\pi_k$ live on the simplex, i.e., $\pi_k \geq 0$ and $\sum_{k}\pi_k=1$. The $x_k$ parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such discrete samples, that doesn’t actually require constructing the discrete distribution.

Continue reading “The Gumbel-Max Trick for Discrete Distributions”

Pseudo-marginal MCMC

[latexpage]This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi({\bf x})$, with ${\bf x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state ${\bf x}$ to a proposed state ${\bf x}’$ with probability $\min(1, \pi({\bf x}’)/\pi({\bf x}))$ .

But what if we cannot evaluate $\pi({\bf x})$ exactly? Such a situation might arise if we are given a joint density function $\pi({\bf x}, {\bf z})$, with ${\bf z} \in \mathcal Z$, and we must marginalize out ${\bf z}$ in order to compute $\pi({\bf x})$. In this situation, we may only be able to approximate

\[ \pi({\bf x}) = \int \pi({\bf x},{\bf z}) \, \mathrm d{\bf z} ,\] Continue reading “Pseudo-marginal MCMC”

Chernoff’s bound

[latexpage]

If you have some randomness in your life, chances are that you want to try Chernoff’s bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff’s bound or its cousins).

My favorite form of Chernoff’s bound is: for $X_1, …, X_n$ independent binary random variables, and $X=\sum_{i=1}^n X_i$, $\mu = E[X]$ and $\delta > 2e-1$, then

\[

P(X > (1+\delta)\mu) < 2^{-\delta \mu}.

\]

Note that $X_i’s$ are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so $\delta$ is typically larger than $2e-1$.

In the standard applications, the stochastic system has size $N$ and an event of interest, $X$, has expectation $\mu=O(\log N)$. The probability that any one event deviates significantly is inverse polynomial in the size of the whole system

\[

P(X>(1+\delta)\mu) < \frac{1}{N^{\delta}}.

\]

This is important since the total number of events is polynomial in $N$ for many settings. By simple Union Bound,

\[

P(\mbox{any } X>(1+\delta)\mu) = O(\frac{1}{N}).

\]

So the chance of any deviation in any event is small!

 

 

Upcoming Conferences

We’re just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling.

Variational Inference (part 1)

[latexpage]

I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic.

The goal of variational inference is to approximate an intractable probability distribution, $p$, with a tractable one, $q$, in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.

Continue reading “Variational Inference (part 1)”

Geometric means of distributions

Annealed importance sampling [1] is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I’m not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.

Continue reading “Geometric means of distributions”

Stochastic memoization in Haskell

I thought it would be fun to quickly demonstrate how to write a stateless stochastic memoizer in Haskell. The idea behind stochastic memoization is to take some base generative process and transform it into another process which, on each sample, either draws from the original process or draws from the distribution of previously drawn samples. This allows you, for example, to introduce memory or smoothness into your sampling distribution [1, 2]. Continue reading “Stochastic memoization in Haskell”