This is a continuation of my last post about data compression and machine learning. In this post, I will start to address the question: Does “good” compression generally lead to “good” unsupervised learning? To answer this question, we need to start with another question: What is a “good” compression algorithm?
I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form (1) where is a multivariate Gaussian prior with mean and covariance matrix , and is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution as (2)
Reversible jump Markov Chain Monte Carlo (RJMCMC)  is an extension of the Metropolis-Hastings algorithm that allows sampling from a distribution over models with potentially different numbers of parameters. In this post we are interested in determining the number of components to use when modeling data with a mixture model. The number of components corresponds to the dimension of the space we are walking through. The point of this post is to clear up a common error seen in the literature involving computing a Jacobian that arises in the algorithm.
As the title of the post suggests, this week I will discuss a geometric intuition for Markov’s inequality, which for a nonnegative random variable, , states This is a simple result in basic probability that still felt surprising every time I used it… until very recently. (Warning: Basic measure theoretic probability lies ahead. These notes look like they provide sufficient background if this post is confusing and you are sufficiently motivated!)
When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter , such that and . A somewhat more common occurrence is that we have a where , but we don’t know the normalization constant. That is, our is only proportional to the multinomial parameter . We want to rapidly generate a variate according to , given , something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox): cdf = cumsum(phi); samp_k = sum(cdf < rand()*cdf(end)) + 1; This is nice and simple, but you'll notice that it has time complexity for setup (computing the ... Read More
In my last post, I argued that a major distinction in machine learning is between predictive learning and representation learning. Now I’ll take a stab at summarizing what representation learning is about. Or, at least, what I think of as the first principal component of representation learning.
Ryan Adams and I just uploaded to the arXiv our paper “High-Dimensional Probability Estimation with Deep Density Models”. In this work, we introduce the deep density model (DDM), a new approach for density estimation.
Data compression and unsupervised learning are two concepts whose relationship is perhaps underappreciated. Compression and unsupervised learning are both about finding patterns in data — but, does the similarity go any further? I argue that it does.
I don’t have a favorite distribution, but if I had to pick one, I’d say the gamma. Why not the Gaussian? Because everyone loves the Gaussian! But when you want a prior distribution for the mean of your Poisson, or the variance of your Normal, who’s there to pick up the mess when the Gaussian lets you down? The gamma. When you’re trying to actually sample that Dirichlet that makes such a nice prior distribution for categorical distributions over your favorite distribution (how about that tongue twister), who’s there to help you? You guessed it, the gamma. But if you want a distribution that you can sample millions of times during each iteration of your MCMC algorithm, well, now the … Read More
An exponential family parametrized by is the set of probability distributions that can be expressed as for given functions (the partition function), , and (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified.