[latexpage] Before diving into more technical posts, I want to briefly touch on some basic questions, and the big picture behind unsupervised learning. I also want to do a bit of handwaving on sparsity—a topic that has gotten a lot of attention recently. Let’s say we are given observations $\mathbf{y}_1,\ldots,\mathbf{y}_N\in\mathbb{R}^D$. These points are assumed to contain some underlying structure, which we seek to capture in order to perform tasks such as classification or compression. We can apply our algorithms on the data in their raw form—which carries unidentified redundancy—and hope for the best. However, a more sensible approach would be to first

## Nonparanormal Activity

[latexpage] Say you have a set of $n$ $p$-dimensional iid samples $\{ \textbf x_i \}_{i=1}^n$ drawn from some unknown continuous distribution that you want to estimate with an undirected graphical model. You can sometimes get away with assuming the $\textbf x_i$’s are drawn from a multivariate normal (MVN), and from there you can use a host of methods for estimating the covariance matrix $\Sigma$, and thus the graph structure $\Omega = \Sigma^{-1}$ (perhaps imposing sparsity constraints for inferring structure in high dimensional data, $n<<p$). In other cases the Gaussian assumption is too restrictive (e.g. when marginals exhibit multimodal behavior). One way to augment the expressivity of the MVN while maintaining some of the desirable properties is to assume that some …

## Discriminative (supervised) Learning

Often the goal of inference and learning is to use the inferred marginal distributions for prediction or classification purposes. In such scenarios, finding the correct “model structure” or the true “model parameters”, via maximum-likelihood (ML) estimation or (generalized) expectation-maximization (EM), is secondary to the final objective of minimizing a prediction or a classification cost function. Recently, I came across a few interesting papers on learning and inference in graphical models by direct optimization of a cost function of the inferred marginal distributions (or normalized beliefs) [1, 2, 3, 4]: , where f is a differentiable function that maps the beliefs (bs) to the outcomes/labels of interest, is a set of model parameters, and C is a differentiable cost function that penalizes for incorrect …

## Method of moments

[latexpage] The method of moments is a simple idea for estimating the parameters of a model. Suppose the observed data are sampled iid from a distribution $p(x|\theta)$, our goal is to estimate the parameter $\theta$. If we have enough data, we can get very close to the true mean of the distribution, $E[x]$, which is a function of $\theta$: $E[x]=f_1(\theta)$. We know the form of $f_1$ since we know the form of $p(x|\theta)$. For simple distributions, knowing just the mean is enough to invert $f_1$ to obtain $\theta$. In general, we need to calculate the higher moments, which are also known functions of $\theta$: $E[x^2] = f_2(\theta), …, E[x^n]=f_n(\theta)$. We then try to invert the systems of equations $f_1, …, f_n$ …

## Turning Theory into Algorithms

[latexpage] Some of the common complaints I hear about (learning) theoretical work run along the lines of “those bounds are meaningless in practice,” “that result doesn’t apply to any algorithm someone would actually use,” and “you lost me as soon as martingales/Banach spaces/measure-theoretic niceties/… got involved.” I don’t have a good answer for the latter concern, but a very nice paper by Sasha Rakhlin, Ohad Shamir, and Karthik Sridharan at NIPS this year goes some ways toward address the first two criticisms. Their paper, “Relax and Randomize: From Value to Algorithms,” (extended version here) is concerned with transforming non-constructive online regret bounds into useful algorithms.

## The “Computation” in Computational Neuroscience

My aim in this introductory post is to provide context for future contributions by sharing my thoughts on the role of computer scientists in the study of brain, particularly in the field of computational neuroscience. For a field with “computation” in the name, it seems that computer scientists are underrepresented. I believe there are significant open questions in neuroscience which are best addressed by those who study the theory of computation, learning, and algorithms, and the systems upon which they are premised. In my opinion “computational neuroscience” has two definitions: the first, from Marr, is the study of the computational capabilities of the brain, their algorithmic details, and their implementation in neural circuits; the second, stemming from machine learning, is …

## Learning Image Features from Video

While at NIPS, I came across the paper Deep Learning of Invariant Features via Simulated Fixations in Video by Will Zou, Shenghuo Zhu, Andrew Ng, and Kai Yu. It proposes a particularly appealing unsupervised method for using videos to learn image features. Their method appears to be somewhat inspired by the human visual system. For instance, people have access to video data, not static images. They also attempt to mimic the human tendency to fixate on particular objects. They track objects through successive frames in order to provide more coherent data to the learning algorithm. The authors use a stacked architecture, where each layer is trained by optimizing an embedding into a feature space. As usual, the optimization problem involves a reconstruction …