Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function at some locations , and we’d like to predict the function’s value at some new location . Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case. I will assume that is an intrinsically stationary process. In other words, there exists some semivariogram such that Furthermore, I will assume that the process is isotropic, (i.e. that is a function only of ). As Andy described here, the existence … Read More
I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about: It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on. This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative … Read More
It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation: where the are unconstrained in , but the live on the simplex, i.e., and . The parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such … Read More
This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function , with , and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state to a proposed state with probability . But what if we cannot evaluate exactly? Such a situation might arise if we are given a joint density function , with , and we must marginalize out in order to compute . In this situation, we may only be able to approximate
If you have some randomness in your life, chances are that you want to try Chernoff’s bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff’s bound or its cousins). My favorite form of Chernoff’s bound is: for independent binary random variables, and , and , then Note that are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so is typically larger than . In the standard applications, the stochastic system has size and an event of interest, , has expectation . The … Read More
We’re just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling. International Conference on Artificial Intelligence and Statistics (AISTATS) in Scottsdale, AZ: April 29 – May 1, 2013 New England Machine Learning Day at Microsoft Research New England: May 1, 2013 First International Conference on Learning Representations (ICLR) in Scottsdale, AZ: May 2-4, 2013 Conference on Bayesian Nonparametrics, Amsterdam: June 10-14, 2013 Conference on Learning Theory (COLT), Princeton, NJ: June 12-14, 2013 International Conference on Machine Learning (ICML), Atlanta, GA: June 16-21, 2013 Conference on Uncertainty in Artificial Intelligence (UAI), Bellvue, WA: July 11-15, 2013 Joint Statistical Meetings (JSM), Montreal: August 3-8, … Read More
I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic. The goal of variational inference is to approximate an intractable probability distribution, , with a tractable one, , in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.
Annealed importance sampling  is a widely used algorithm for inference in probabilistic models, as well as computing partition functions. I’m not going to talk about AIS itself here, but rather one aspect of it: geometric means of probability distributions, and how they (mis-)behave.
In my previous post, I wrote about why I find learning theory to be a worthwhile endeavor. In this post I want to discuss a few open/under-explored areas in learning theory that I think are particularly interesting. If you know of progress in these areas, please email me or post in the comments.
I thought it would be fun to quickly demonstrate how to write a stateless stochastic memoizer in Haskell. The idea behind stochastic memoization is to take some base generative process and transform it into another process which, on each sample, either draws from the original process or draws from the distribution of previously drawn samples. This allows you, for example, to introduce memory or smoothness into your sampling distribution [1, 2].