What is representation learning?

In my last post, I argued that a major distinction in machine learning is between predictive learning and representation learning. Now I’ll take a stab at summarizing what representation learning is about. Or, at least, what I think of as the first principal component of representation learning. Continue reading “What is representation learning?”

High-Dimensional Probability Estimation with Deep Density Models


Ryan Adams and I just uploaded to the arXiv our paper “High-Dimensional Probability Estimation with Deep Density Models”. In this work, we introduce the deep density model (DDM), a new approach for density estimation. Continue reading “High-Dimensional Probability Estimation with Deep Density Models”

Data compression and unsupervised learning

Data compression and unsupervised learning are two concepts whose relationship is perhaps underappreciated. Compression and unsupervised learning are both about finding patterns in data — but, does the similarity go any further? I argue that it does. Continue reading “Data compression and unsupervised learning”

A Parallel Gamma Sampling Implementation


I don’t have a favorite distribution, but if I had to pick one, I’d say the gamma.  Why not the Gaussian? Because everyone loves the Gaussian! But when you want a prior distribution for the mean of your Poisson, or the variance of your Normal, who’s there to pick up the mess when the Gaussian lets you down? The gamma. When you’re trying to actually sample that Dirichlet that makes such a nice prior distribution for categorical distributions over your favorite distribution (how about that tongue twister), who’s there to help you?  You guessed it, the gamma. But if you want a distribution that you can sample millions of times during each iteration of your MCMC algorithm, well, now the Gaussian is looking pretty good, but let’s not give up hope on the gamma just yet.

Continue reading “A Parallel Gamma Sampling Implementation”

Exponential Families and Maximum Entropy

[latexpage]An exponential family parametrized by $\boldsymbol\theta \in \mathbb R^d$ is the set of probability distributions that can be expressed as

\[ p({\bf x} \,|\, \boldsymbol\theta) =\frac{1}{Z(\boldsymbol\theta)} h({\bf x}) \exp\left( \boldsymbol\theta^{\mathsf T}\boldsymbol\phi({\bf x}) \right) ,\]

for given functions $Z(\boldsymbol\theta)}$ (the partition function), $h({\bf x})$, and $\boldsymbol\phi({\bf x})$ (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified. Continue reading “Exponential Families and Maximum Entropy”

Learning Theory: Purely Theoretical?


What’s learning theory good for, anyway? As I mentioned in my earlier blog post, not infrequently get into conversations with people in machine learning and related fields who don’t see the benefit of learning theory (that is, theory of learning). While that post offered one specific piece of evidence of how work seemingly only relevant in pure theory could lead to practical algorithms, I thought I would talk in more general terms why I see learning theory as a worthwhile endeavor.

There are two main flavors of learning theory, statistical learning theory (StatLT) and computational learning (CompLT). StatLT originated with Vladimir Vapnik, while the canonical example of CompLT, PAC learning, was formulated by Leslie Valiant. StatLT, in line with its “statistical” descriptor, focuses on asymptotic questions (though generally based on useful non-asymptotic bounds). It is less concerned with computational efficiency, which is where CompLT comes in. Computer scientists are all about efficient algorithms (which for the purposes of theory essentially means polynomial vs. super-polynomial time). Generally, StatLT results apply to a wider variety of hypothesis classes, with few or no assumptions made about the concept class (a concept class refers to the class of functions to which the data generating mechanism belongs). CompLT results apply to very specific concept classes but have stronger performance guarantees, often using polynomial time algorithms. I’ll do my best to defend both flavors, while also mentioning some of their limitations.

Continue reading “Learning Theory: Purely Theoretical?”

The Fundamental Matrix of a Finite Markov Chain

The purpose of this post is to present the very basics of potential theory for finite Markov chains. This post is by no means a complete presentation but rather aims to show that there are intuitive finite analogs of the potential kernels that arise when studying Markov chains on general state spaces. By presenting a piece of potential theory for Markov chains without the complications of measure theory I hope the reader will be able to appreciate the big picture of the general theory.
Continue reading “The Fundamental Matrix of a Finite Markov Chain”

Disconnectivity graphs

I would like to briefly introduce disconnectivity graphs — striking visualizations of multidimensional energy landscapes that I had never seen before. While it’s not immediately obvious how useful they are, it should be straightforward to adapt them for visualizing probability distributions.

A quick Google search for ‘disconnectivity graph’ will turn up lots of examples. These things look like chandeliers and are meant to summarize the potential energy surface of a molecule, potentially with many degrees of freedom and many local optima Continue reading “Disconnectivity graphs”

Correlation and Mutual Information


Mutual information is a quantification of the dependency between random variables. It is sometimes contrasted with linear correlation since mutual information captures nonlinear dependence. In this short note I will discuss the relationship between these quantities in the case of a bivariate Gaussian distribution, and I will explore two implications of that relationship.
Continue reading “Correlation and Mutual Information”

Getting above the fray with lifted inference

Hi, I’m Jon. In my series of posts, I’ll be writing about how we can use the modern Bayesian toolkit to efficiently make decisions, solve problems, and formulate plans (the providence of AI), rather than restrict ourselves to approximating posteriors (the providence of statistics and much of machine learning).

Here’s a simple example of how AI can help out machine learning. What was the first graphical model you were exposed to? There’s a good chance it was Pearl’s famous “Sprinkler, Rain, Wet grass” graphical model[1]. Continue reading “Getting above the fray with lifted inference”