Bayesian nonparametrics allow the contruction of statistical models whose complexity is determined by the observed data. This is accomplished by specifying priors over infinite dimensional distributions. The most widely used Bayesian nonparametric priors in machine learning are the Dirichlet process, the beta process and their corresponding marginal processes the Chinese restaurant process and the Indian buffet process respectively. The Dirichlet process provides a prior for the mixing measure of an infinite mixture model and the beta process can be used as a prior for feature popularity in a latent feature model. The hierarchical Dirichlet process (HDP) also appears frequently in machine learning underlying topic models with an infinite number of topics. A main selling point of Bayesian nonparametrics has been that … Read More
One of the questions in the air at NIPS 2012 was, how do we make machine learning algorithms scale to large datasets? There are two main approaches: (1) developing parallelizable ML algorithms and integrating them with large parallel systems and (2) developing more efficient algorithms. More often than not, the latter approach requires some sort of relaxation of an underlying task. Hashing, streaming algorithms and sketching are increasingly employed to achieve efficient approximate algorithms that arise in ML tasks. Below, I highlight a few examples, mostly from NIPS 2012, with several coming from the Big Learning workshop. Nearest neighbor search (or similarity search) appears in many “meta” ML tasks such as information retrieval and near-duplicate detection. Many approximate approches are … Read More
Before diving into more technical posts, I want to briefly touch on some basic questions, and the big picture behind unsupervised learning. I also want to do a bit of handwaving on sparsity—a topic that has gotten a lot of attention recently. Let’s say we are given observations . These points are assumed to contain some underlying structure, which we seek to capture in order to perform tasks such as classification or compression. We can apply our algorithms on the data in their raw form—which carries unidentified redundancy—and hope for the best. However, a more sensible approach would be to first
Say you have a set of -dimensional iid samples drawn from some unknown continuous distribution that you want to estimate with an undirected graphical model. You can sometimes get away with assuming the ‘s are drawn from a multivariate normal (MVN), and from there you can use a host of methods for estimating the covariance matrix , and thus the graph structure (perhaps imposing sparsity constraints for inferring structure in high dimensional data, ). In other cases the Gaussian assumption is too restrictive (e.g. when marginals exhibit multimodal behavior). One way to augment the expressivity of the MVN while maintaining some of the desirable properties is to assume that some function of the data is MVN.
Often the goal of inference and learning is to use the inferred marginal distributions for prediction or classification purposes. In such scenarios, finding the correct “model structure” or the true “model parameters”, via maximum-likelihood (ML) estimation or (generalized) expectation-maximization (EM), is secondary to the final objective of minimizing a prediction or a classification cost function. Recently, I came across a few interesting papers on learning and inference in graphical models by direct optimization of a cost function of the inferred marginal distributions (or normalized beliefs) [1, 2, 3, 4]: , where f is a differentiable function that maps the beliefs (bs) to the outcomes/labels of interest, is a set of model parameters, and C is a differentiable cost function that penalizes for incorrect … Read More
The method of moments is a simple idea for estimating the parameters of a model. Suppose the observed data are sampled iid from a distribution , our goal is to estimate the parameter . If we have enough data, we can get very close to the true mean of the distribution, , which is a function of : . We know the form of since we know the form of . For simple distributions, knowing just the mean is enough to invert to obtain . In general, we need to calculate the higher moments, which are also known functions of : . We then try to invert the systems of equations to obtain . In practice, we typically only have … Read More
Some of the common complaints I hear about (learning) theoretical work run along the lines of “those bounds are meaningless in practice,” “that result doesn’t apply to any algorithm someone would actually use,” and “you lost me as soon as martingales/Banach spaces/measure-theoretic niceties/… got involved.” I don’t have a good answer for the latter concern, but a very nice paper by Sasha Rakhlin, Ohad Shamir, and Karthik Sridharan at NIPS this year goes some ways toward address the first two criticisms. Their paper, “Relax and Randomize: From Value to Algorithms,” (extended version here) is concerned with transforming non-constructive online regret bounds into useful algorithms.
My aim in this introductory post is to provide context for future contributions by sharing my thoughts on the role of computer scientists in the study of brain, particularly in the field of computational neuroscience. For a field with “computation” in the name, it seems that computer scientists are underrepresented. I believe there are significant open questions in neuroscience which are best addressed by those who study the theory of computation, learning, and algorithms, and the systems upon which they are premised. In my opinion “computational neuroscience” has two definitions: the first, from Marr, is the study of the computational capabilities of the brain, their algorithmic details, and their implementation in neural circuits; the second, stemming from machine learning, is … Read More
While at NIPS, I came across the paper Deep Learning of Invariant Features via Simulated Fixations in Video by Will Zou, Shenghuo Zhu, Andrew Ng, and Kai Yu. It proposes a particularly appealing unsupervised method for using videos to learn image features. Their method appears to be somewhat inspired by the human visual system. For instance, people have access to video data, not static images. They also attempt to mimic the human tendency to fixate on particular objects. They track objects through successive frames in order to provide more coherent data to the learning algorithm. The authors use a stacked architecture, where each layer is trained by optimizing an embedding into a feature space. As usual, the optimization problem involves a reconstruction … Read More