The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables . Define the random variable (1) to be a scaled sum of the first variables in the sequence. Now, we would like to make interesting statements about the sequence (2)
Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function at some locations , and we’d like to predict the function’s value at some new location . Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case. I will assume that is an intrinsically stationary process. In other words, there exists some semivariogram such that Furthermore, I will assume that the process is isotropic, (i.e. that is a function only of ). As Andy described here, the existence … Read More
I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about: It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on. This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative … Read More
This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function , with , and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state to a proposed state with probability . But what if we cannot evaluate exactly? Such a situation might arise if we are given a joint density function , with , and we must marginalize out in order to compute . In this situation, we may only be able to approximate
I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic. The goal of variational inference is to approximate an intractable probability distribution, , with a tractable one, , in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.
When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter , such that and . A somewhat more common occurrence is that we have a where , but we don’t know the normalization constant. That is, our is only proportional to the multinomial parameter . We want to rapidly generate a variate according to , given , something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox): cdf = cumsum(phi); samp_k = sum(cdf < rand()*cdf(end)) + 1; This is nice and simple, but you'll notice that it has time complexity for setup (computing the ... Read More
Ryan Adams and I just uploaded to the arXiv our paper “High-Dimensional Probability Estimation with Deep Density Models”. In this work, we introduce the deep density model (DDM), a new approach for density estimation.
I don’t have a favorite distribution, but if I had to pick one, I’d say the gamma. Why not the Gaussian? Because everyone loves the Gaussian! But when you want a prior distribution for the mean of your Poisson, or the variance of your Normal, who’s there to pick up the mess when the Gaussian lets you down? The gamma. When you’re trying to actually sample that Dirichlet that makes such a nice prior distribution for categorical distributions over your favorite distribution (how about that tongue twister), who’s there to help you? You guessed it, the gamma. But if you want a distribution that you can sample millions of times during each iteration of your MCMC algorithm, well, now the … Read More
An exponential family parametrized by is the set of probability distributions that can be expressed as for given functions (the partition function), , and (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified.
Mutual information is a quantification of the dependency between random variables. It is sometimes contrasted with linear correlation since mutual information captures nonlinear dependence. In this short note I will discuss the relationship between these quantities in the case of a bivariate Gaussian distribution, and I will explore two implications of that relationship.
- Page 1 of 2