Roger Grosse’s post on the need for a “solid theoretical framework” for “representation learning” is very intriguing. The term representation is ubiquitous in machine learning (for instance, it showed up in at least eight previous posts in this blog) and computational neuroscience (how are objects and concepts represented within the brain). My personal fascination with the topic started after watching David Krakauer’s talk on evolution of intelligence on earth, where he listed representation- in additions to inference, strategy, and Competition- as one of the tenets of intelligence; suggesting that our representations are tightly connected to the goals we aim to accomplish, how we infer hidden causes, what strategy we take on, and what competitive forces we have to deal with. … Read More
When you take a machine learning class, there’s a good chance it’s divided into a unit on supervised learning and a unit on unsupervised learning. We certainly care about this distinction for a practical reason: often there’s orders of magnitude more data available if we don’t need to collect ground-truth labels. But we also tend to think it matters for more fundamental reasons. In particular, the following are some common intuitions:
I recently read the paper “Variational Inference for Crowdsourcing,” by Qiang Liu, Jian Peng, and Alexander Ihler. They present an approach using belief propagation to deal with reliability when using crowdsourcing to collect labeled data. This post is based on their exposition. Crowdsourcing (via services such as Amazon Mechanical Turk) has been used as a cheap way to amass large quantities of labeled data. However, the labels are likely to be noisy. To deal with this, a common strategy is to employ redundancy: each task is labeled by multiple workers. For simplicity, suppose there are tasks and workers, and assume that the possible labels are . Define the matrix so that is the label given to task by worker (or … Read More
A common activity in statistics and machine learning is optimization. For instance, finding maximum likelihood and maximum a posteriori estimates require maximizing the likilihood function and posterior distribution respectively. Another example, and the motivating example for this post, is using variational inference to approximate a posterior distribution. Suppose we are interested in a posterior distribution, , that we cannot compute analytically. We will approximate with the variational distribution that is parameterized by the variational parameters . Variational inference then proceeds to minimize the KL divergence from to , . The dominant assumption in machine learning for the form of is a product distribution, that is (where we assume there are variational parameters). It can be shown that minimizing is equivalent … Read More
Developing efficient (i.e. polynomial time) algorithms with guaranteed performance is a central goal in computer science (perhaps the central goal). In machine learning, inference algorithms meeting these requirements are much rarer than we would like: often, an algorithm is either efficient but doesn’t perform optimally or vice versa. A number of results from the 1990’s demonstrate the challenges of, but also the potential for, efficient Bayesian inference. These results were carried out in the context of Bayesian networks.
I just attended a fun event, Celebrating 100 Years of Markov Chains, at the Institute for Applied Computational Science. There were three talks and they were taped, so hopefully you will be able to find the videos through the IACS website in the near future. Below, I will review some highlights of the first two talks by Brian Hayes and Ryan Adams; I’m skipping the last one because it was more of a review of concepts building up to and surrounding Markov chain Monte Carlo (MCMC). The first talk was intriguingly called “First Links in the Markov Chain: Poetry and Probability”
Jeff Miller and Matthew Harrison at Brown (go Bears!) have recently explored the posterior consistency of Dirichlet process mixture (DPM) models, emphasizing one particular drawback. For setup, say you have some observed data from a mixture of two normals, such as In this case, the number of clusters, , is two, and one would imagine that as grows, the posterior distribution of would converge to 2, i.e. . However, this is not true if you model the data with a DPM (or more generally, modeling the mixing measure as a Dirichlet process, ).
In machine learning, we often want to evaluate how well a model describes a dataset. In an unsupervised setting, we might use one of two criteria: marginal likelihood, or Bayes factor: the probability of the data, with all parameters and latent variables integrated out held-out likelihood, or the probability of held-out test data under the parameters learned on the training set Both of these criteria can require computing difficult high-dimensional sums or integrals, which I’ll refer to here as the partition function. In most cases, it’s infeasible to solve these integrals exactly, so we rely on approximation techniques, such as variational inference or sampling. Often you’ll hear claims that such-and-such an algorithm is an unbiased estimator of the partition function. … Read More
In my previous post I suggested that models of neural computation can be expressed as prior distributions over functional and effective connectivity, and with this common specification we can compare models by their posterior probability given neural recordings. I would like to explore this idea in more detail by first describing functional and effective connectivity and then considering how various models could be expressed in this framework. Functional and effective connectivity are concepts originating in neuroimaging and spike train analysis. Functional connectivity measures the correlation between neurophysiological events (e.g. spikes on neurons or BOLD signal in fMRI voxels), whereas effective connectivity is a statement about the causal nature of a system. Effective connectivity captures the influence one neurophysiological event has … Read More
Continuous problems are often simpler to solve than discrete problems. This is true in many optimization problems (for instance, linear programming versus integer linear programming). In the case of Markov chain Monte Carlo (MCMC), sampling continuous distributions has some advantages over sampling discrete distributions due to the availability of gradient information in the continuous case. The paper “Continuous Relaxations for Discrete Hamiltonian Monte Carlo” by Yichuan Zhang, Charles Sutton, Amos Storkey, and Zoubin Ghahramani explores the idea of performing inference in the discrete setting by deriving and sampling a related continuous distribution. Here I describe the approach taken in this paper.