Machine Learning Glossary

Michael GelbartBlog, Machine Learning

I often have a hard time understanding the terminology in machine learning, even after almost three years in the field. For example, what is a Deep Belief Network? I attended a whole summer school on Deep Learning, but I’m still not quite sure. I decided to take a leap of faith and assume this is not just because the Deep Belief Networks in my brain are not functioning properly (although I am sure this is a factor). So, I created a Machine Learning Glossary to try to define some of these terms. The glossary can be found here. I have tried to write in an unpretentious style, defining things systematically and leaving no “exercises to the reader”. I also have … Read More

Optimal Spatial Prediction with Kriging

Robert NishiharaBlog, Machine Learning, Statistics

Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function at some locations , and we’d like to predict the function’s value at some new location . Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see¬†Cressie and Wikle (2011).¬†Here I derive Kriging in a simplified case. I will assume that is an intrinsically stationary process. In other words, there exists some semivariogram such that     Furthermore, I will assume that the process is isotropic, (i.e. that is a function only of ). As Andy described here, the existence … Read More

Fisher information

Roger GrosseBlog, Statistics

I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about:     It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we’re trying to make predictions, not present someone with a set of parameters. It doesn’t really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on. This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative … Read More