## Data compression and unsupervised learning, Part 2

[latexpage]

This is a continuation of my last post about data compression and machine learning. In this post, I will start to address the question:

Does “good” compression generally lead to “good” unsupervised learning?

What is a “good” compression algorithm?

## An Auxiliary Variable Trick for MCMC

[latexpage]I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form

\pi({\bf x}) \propto \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) L({\bf x}),

where $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ is a multivariate Gaussian prior with mean $\boldsymbol\mu$ and covariance matrix $\boldsymbol\Sigma$, and $L({\bf x})$ is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution $\pi({\bf x})$ as

\pi({\bf x}) = \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) \cdot \frac{\pi({\bf x})}{\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)},

## What is representation learning?

In my last post, I argued that a major distinction in machine learning is between predictive learning and representation learning. Now I’ll take a stab at summarizing what representation learning is about. Or, at least, what I think of as the first principal component of representation learning. Continue reading “What is representation learning?”

## High-Dimensional Probability Estimation with Deep Density Models

[latexpage]

Ryan Adams and I just uploaded to the arXiv our paper “High-Dimensional Probability Estimation with Deep Density Models”. In this work, we introduce the deep density model (DDM), a new approach for density estimation. Continue reading “High-Dimensional Probability Estimation with Deep Density Models”

## Data compression and unsupervised learning

Data compression and unsupervised learning are two concepts whose relationship is perhaps underappreciated. Compression and unsupervised learning are both about finding patterns in data — but, does the similarity go any further? I argue that it does. Continue reading “Data compression and unsupervised learning”

## Learning Theory: Purely Theoretical?

[latexpage]

What’s learning theory good for, anyway? As I mentioned in my earlier blog post, not infrequently get into conversations with people in machine learning and related fields who don’t see the benefit of learning theory (that is, theory of learning). While that post offered one specific piece of evidence of how work seemingly only relevant in pure theory could lead to practical algorithms, I thought I would talk in more general terms why I see learning theory as a worthwhile endeavor.

There are two main flavors of learning theory, statistical learning theory (StatLT) and computational learning (CompLT). StatLT originated with Vladimir Vapnik, while the canonical example of CompLT, PAC learning, was formulated by Leslie Valiant. StatLT, in line with its “statistical” descriptor, focuses on asymptotic questions (though generally based on useful non-asymptotic bounds). It is less concerned with computational efficiency, which is where CompLT comes in. Computer scientists are all about efficient algorithms (which for the purposes of theory essentially means polynomial vs. super-polynomial time). Generally, StatLT results apply to a wider variety of hypothesis classes, with few or no assumptions made about the concept class (a concept class refers to the class of functions to which the data generating mechanism belongs). CompLT results apply to very specific concept classes but have stronger performance guarantees, often using polynomial time algorithms. I’ll do my best to defend both flavors, while also mentioning some of their limitations.

## Getting above the fray with lifted inference

Hi, I’m Jon. In my series of posts, I’ll be writing about how we can use the modern Bayesian toolkit to efficiently make decisions, solve problems, and formulate plans (the providence of AI), rather than restrict ourselves to approximating posteriors (the providence of statistics and much of machine learning).

Here’s a simple example of how AI can help out machine learning. What was the first graphical model you were exposed to? There’s a good chance it was Pearl’s famous “Sprinkler, Rain, Wet grass” graphical model[1]. Continue reading “Getting above the fray with lifted inference”

## What the hell is representation? *

Roger Grosse’s post on the need for a “solid theoretical framework” for “representation learning” is very intriguing. The term representation is ubiquitous in machine learning (for instance, it showed up in at least eight previous posts in this blog) and computational neuroscience (how are objects and concepts represented within the brain).

My personal fascination with the topic started after watching David Krakauer’s talk on evolution of intelligence on earth, where he listed representation- in additions to inference, strategy, and Competition- as one of the tenets of intelligence; suggesting that our representations are tightly connected to the goals we aim to accomplish, how we infer hidden causes, what strategy we take on, and what competitive forces we have to deal with.

Professor Krakauer goes on to reason that what enabled invention of Algebra was “efficient” representation of numbers via the Arabic numeral system (think about 3998 in Arabic numerals versus Roman numerals: MMMCMXCVIII), which allowed for easier manipulation of numbers (for exmaple, using Arabic numerals: 42 x 133 = 5586, versus using Roman numerals: XLII x CXXXIII = MMMMMDLXXXVI). The Arabic numeral system is not only more compressed, but also yields itself to easier compositional rules (not clear how the second multiplication works!).

So, why representation is important? The short answer is that “good” representation makes life so much easier, presumably, by reducing the computational burden of doing inference/classification/prediction. How can we systematically arrive at a good representation? In general, good representations seem to keep only those features of the data that co-vary the most with respect to outcomes of interest. If so, then there are no “good” representations, only good representations with respect to a set of objectives, given all the computational and resource constraints.

In conclusion, I suspect combining the unsupervised-learning and supervised-learning into one coherent learning framework can be a good starting point (i.e., extracting latent features from the data that are maximally predictive of outcomes of interest; see Outcome Discriminative Learning and deep learning).

There are these two young fish swimming along, and they happen to meet an older fish swimming the other way, who nods at them and says, “Morning, boys, how’s the water?” And the two young fish swim on for a bit, and then eventually one of them looks over at the other and goes, “What the hell is water?” — David Foster Wallace

## Predictive learning vs. representation learning

When you take a machine learning class, there’s a good chance it’s divided into a unit on supervised learning and a unit on unsupervised learning. We certainly care about this distinction for a practical reason: often there’s orders of magnitude more data available if we don’t need to collect ground-truth labels. But we also tend to think it matters for more fundamental reasons. In particular, the following are some common intuitions: Continue reading “Predictive learning vs. representation learning”

## Dealing with Reliability when Crowdsourcing

[latexpage]I recently read the paper “Variational Inference for Crowdsourcing,” by Qiang Liu, Jian Peng, and Alexander Ihler. They present an approach using belief propagation to deal with reliability when using crowdsourcing to collect labeled data. This post is based on their exposition. Crowdsourcing (via services such as Amazon Mechanical Turk) has been used as a cheap way to amass large quantities of labeled data. However, the labels are likely to be noisy.

To deal with this, a common strategy is to employ redundancy: each task is labeled by multiple workers. For simplicity, suppose there are $N$ tasks and $M$ workers, and assume that the possible labels are $\{\pm 1\}$. Define the $N \times M$ matrix $L$ so that $L_{ij}$ is the label given to task $i$ by worker $j$ (or $0$ if no label was provided). Let $z = (z_1, \ldots, z_N)$ be the true labels of the tasks. Given $L$, we wish to come up with an estimator $\hat{z}$ of $z$ so as to minimize the average error

\begin{align}

\frac{1}{N} \sum_{i=1}^N \text{prob}[\hat{z}_i \ne z_i] .