I just started a course on spatial statistics, so I’ve got covariance functions and variograms on the mind. This post is mostly for me to work through their intuition and relationship. Say you have some spatio-temporal process, with specific locations denoted , with the value of the process those points are . For concreteness, these locations could be latitude and longitude and the field could be the outdoor temperature. Or maybe the locations are the the space-time of a player on a basketball court and the field is her shot percentage or scoring efficiency from that point.
I recently read the paper “Variational Inference for Crowdsourcing,” by Qiang Liu, Jian Peng, and Alexander Ihler. They present an approach using belief propagation to deal with reliability when using crowdsourcing to collect labeled data. This post is based on their exposition. Crowdsourcing (via services such as Amazon Mechanical Turk) has been used as a cheap way to amass large quantities of labeled data. However, the labels are likely to be noisy. To deal with this, a common strategy is to employ redundancy: each task is labeled by multiple workers. For simplicity, suppose there are tasks and workers, and assume that the possible labels are . Define the matrix so that is the label given to task by worker (or … Read More
A common activity in statistics and machine learning is optimization. For instance, finding maximum likelihood and maximum a posteriori estimates require maximizing the likilihood function and posterior distribution respectively. Another example, and the motivating example for this post, is using variational inference to approximate a posterior distribution. Suppose we are interested in a posterior distribution, , that we cannot compute analytically. We will approximate with the variational distribution that is parameterized by the variational parameters . Variational inference then proceeds to minimize the KL divergence from to , . The dominant assumption in machine learning for the form of is a product distribution, that is (where we assume there are variational parameters). It can be shown that minimizing is equivalent … Read More
In my last blog post I wrote about the asymptotic equipartition principle. This week I will write about something completely unrelated. This blog post evolved from a discussion with Brendan O’Connor about science and evidence. The back story is as follows.
Jeff Miller and Matthew Harrison at Brown (go Bears!) have recently explored the posterior consistency of Dirichlet process mixture (DPM) models, emphasizing one particular drawback. For setup, say you have some observed data from a mixture of two normals, such as In this case, the number of clusters, , is two, and one would imagine that as grows, the posterior distribution of would converge to 2, i.e. . However, this is not true if you model the data with a DPM (or more generally, modeling the mixing measure as a Dirichlet process, ).
In machine learning, we often want to evaluate how well a model describes a dataset. In an unsupervised setting, we might use one of two criteria: marginal likelihood, or Bayes factor: the probability of the data, with all parameters and latent variables integrated out held-out likelihood, or the probability of held-out test data under the parameters learned on the training set Both of these criteria can require computing difficult high-dimensional sums or integrals, which I’ll refer to here as the partition function. In most cases, it’s infeasible to solve these integrals exactly, so we rely on approximation techniques, such as variational inference or sampling. Often you’ll hear claims that such-and-such an algorithm is an unbiased estimator of the partition function. … Read More
Continuous problems are often simpler to solve than discrete problems. This is true in many optimization problems (for instance, linear programming versus integer linear programming). In the case of Markov chain Monte Carlo (MCMC), sampling continuous distributions has some advantages over sampling discrete distributions due to the availability of gradient information in the continuous case. The paper “Continuous Relaxations for Discrete Hamiltonian Monte Carlo” by Yichuan Zhang, Charles Sutton, Amos Storkey, and Zoubin Ghahramani explores the idea of performing inference in the discrete setting by deriving and sampling a related continuous distribution. Here I describe the approach taken in this paper.
Bayesian nonparametrics allow the contruction of statistical models whose complexity is determined by the observed data. This is accomplished by specifying priors over infinite dimensional distributions. The most widely used Bayesian nonparametric priors in machine learning are the Dirichlet process, the beta process and their corresponding marginal processes the Chinese restaurant process and the Indian buffet process respectively. The Dirichlet process provides a prior for the mixing measure of an infinite mixture model and the beta process can be used as a prior for feature popularity in a latent feature model. The hierarchical Dirichlet process (HDP) also appears frequently in machine learning underlying topic models with an infinite number of topics. A main selling point of Bayesian nonparametrics has been that … Read More
The Asymptotic Equipartition Property/Principle (AEP) is a well-known result that is likely covered in any introductory information theory class. Nevertheless, when I first learned about it in such a course, I did not appreciate the implications of its general form. In this post I will review this beautiful, classic result and offer the mental picture I have of its implications. I will frame my discussion in terms of Markov chains with discrete state spaces, but note that the AEP holds even more generally. My treatment will be relatively informal, and I will assume basic familiarity with Markov chains. See the references for more details.
Much of what we do when we analyze data and invent algorithms is think about estimators for unknown quantities, even when we don’t directly phrase things this way. One type of estimator that we commonly encounter is the Monte Carlo estimator, which approximates expectations via the sample mean. That is, many problems in which we are interested involve a distribution on a space , where we wish to calculate the expectation of a function : This is very nice because it gives you an unbiased estimator of . That is, the expectation of this estimator is the desired quantity. However, one issue that comes up very often is that we want to find an unbiased estimator of a … Read More
- Page 2 of 2