What the hell is representation? *

Roger Grosse’s post on the need for a “solid theoretical framework” for “representation learning” is very intriguing. The term representation is ubiquitous in machine learning (for instance, it showed up in at least eight previous posts in this blog) and computational neuroscience (how are objects and concepts represented within the brain).

My personal fascination with the topic started after watching David Krakauer’s talk on evolution of intelligence on earth, where he listed representation- in additions to inference, strategy, and Competition- as one of the tenets of intelligence; suggesting that our representations are tightly connected to the goals we aim to accomplish, how we infer hidden causes, what strategy we take on, and what competitive forces we have to deal with.

Professor Krakauer goes on to reason that what enabled invention of Algebra was “efficient” representation of numbers via the Arabic numeral system (think about 3998 in Arabic numerals versus Roman numerals: MMMCMXCVIII), which allowed for easier manipulation of numbers (for exmaple, using Arabic numerals: 42 x 133 = 5586, versus using Roman numerals: XLII x CXXXIII = MMMMMDLXXXVI). The Arabic numeral system is not only more compressed, but also yields itself to easier compositional rules (not clear how the second multiplication works!).

So, why representation is important? The short answer is that “good” representation makes life so much easier, presumably, by reducing the computational burden of doing inference/classification/prediction. How can we systematically arrive at a good representation? In general, good representations seem to keep only those features of the data that co-vary the most with respect to outcomes of interest. If so, then there are no “good” representations, only good representations with respect to a set of objectives, given all the computational and resource constraints.

In conclusion, I suspect combining the unsupervised-learning and supervised-learning into one coherent learning framework can be a good starting point (i.e., extracting latent features from the data that are maximally predictive of outcomes of interest; see Outcome Discriminative Learning and deep learning).


There are these two young fish swimming along, and they happen to meet an older fish swimming the other way, who nods at them and says, “Morning, boys, how’s the water?” And the two young fish swim on for a bit, and then eventually one of them looks over at the other and goes, “What the hell is water?” — David Foster Wallace

Discriminative (supervised) Learning

Often the goal of inference and learning is to use the inferred marginal distributions for prediction or classification purposes. In such scenarios, finding the correct “model structure” or the true “model parameters”, via maximum-likelihood (ML) estimation or (generalized) expectation-maximization (EM), is secondary to the final objective of minimizing a prediction or a classification cost function. Recently, I came across a few interesting papers on learning and inference in graphical models by direct optimization of a cost function of the inferred marginal distributions (or normalized beliefs) [1, 2, 3, 4]:

\( e = C( outcomes, f(bs); \Theta)  \),

where f is a differentiable function that maps the beliefs (bs) to the outcomes/labels of interest, \( \Theta \) is a set of model parameters, and C is a differentiable cost function that penalizes for incorrect classifications or prediction. Continue reading “Discriminative (supervised) Learning”