The Gumbel-Max Trick for Discrete Distributions

[latexpage]It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation:
\pi_k &= \frac{\exp\{x_k\}}{\sum_{k’=1}^K\exp\{x_{k’}\}}
where the $x_k$ are unconstrained in $\mathbb{R}$, but the $\pi_k$ live on the simplex, i.e., $\pi_k \geq 0$ and $\sum_{k}\pi_k=1$. The $x_k$ parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such discrete samples, that doesn’t actually require constructing the discrete distribution.

Continue reading “The Gumbel-Max Trick for Discrete Distributions”

Upcoming Conferences

We’re just about to hit conference season, so I thought I would post a public service announcement identifying various upcoming events for folks into machine learning and Bayesian modeling.

The Alias Method: Efficient Sampling with Many Discrete Outcomes

When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter $\boldsymbol{\pi}\in\mathbb{R}^K$, such that $\pi_k\geq 0$ and $\sum_{k}\pi_k=1$. A somewhat more common occurrence is that we have a $\boldsymbol{\phi}\in\mathbb{R}^K$ where $\phi_k\geq 0$, but we don’t know the normalization constant. That is, our $\boldsymbol{\phi}$ is only proportional to the multinomial parameter $\boldsymbol{\pi}$. We want to rapidly generate a variate according to $\boldsymbol{\pi}$, given $\boldsymbol{\pi}$, something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox):

cdf = cumsum(phi);
samp_k = sum(cdf < rand()*cdf(end)) + 1;

This is nice and simple, but you'll notice that it has $\mathcal{O}(K)$ time complexity for setup (computing the CDF) and $\mathcal{O}(K)$ time complexity per sample. The per-sample complexity could probably be reduced to $\mathcal{O}(\log K)$ with a better data structure for finding the threshold. It turns out, however, that we can do better and get $\mathcal{O}(1)$ for the sampling, while still being $\mathcal{O}(K)$.

Continue reading "The Alias Method: Efficient Sampling with Many Discrete Outcomes"

What is the Computational Capacity of the Brain?

One big recent piece of news from across the Atlantic is that the European Commission is funding a brain simulation project to the tune of a billion Euros. Roughly, the objective is to simulate the entire human brain using a supercomputer. Needless to say, many people are skeptical and there are lots of reasons that one might think this project is unlikely to yield useful results. One criticism centers on whether even a supercomputer can simulate the complexity of the brain. A first step towards simulating a brain is thinking about how many FLOP/s (floating point operations per second) would be necessary to implement similar functionality in conventional computer hardware. Here I will discuss two back-of-the-envelope estimates of this computational capacity. (Don’t take these too seriously, they’re a little crazy and just shooting for orders of magnitude.)
Continue reading “What is the Computational Capacity of the Brain?”

Computing Log-Sum-Exp

[latexpage]This post is about a computational trick that everyone should know, but that doesn’t tend to be explicitly taught in machine learning courses.  Imagine that we have a set of $N$ values, $\{x_n\}^N_{n=1}$ and we want to compute the quantity


z = \log \sum_{n=1}^N \exp\{x_n\}.


This comes up all the time when you want to parameterize a multinomial distribution using a softmax, e.g., when doing logistic regression and you have more than two unordered categories.  If you want to compute the log likelihood, you’ll find such an expression due to the normalization constant.  Computing this naively can be a recipe for disaster, due to underflow or overflow, depending on the scale of the $x_n$.  Consider a simple example, with the vector $[0\;1\;0]$.  This seems pretty straightforward, and we get about $1.55$.  Now what about $[1000\;1001\;1000]$.  This seems like it should also be straightforward, but instead our computer gives us back $\inf$.  If we try $[-1000\;-999\;-1000]$ we get $-\inf$.  What’s happening here?  Well, in your typical 64-bit double, $\exp\{1000\}=\inf$ and $\exp\{-1000\}=0$ due to overflow and underflow, respectively.  Even though the log would make the numbers reasonably scaled again with infinite precision, it doesn’t work on a real computer with typical floating point operations.  What to do? Continue reading “Computing Log-Sum-Exp”

The Poisson Estimator

[latexpage]Much of what we do when we analyze data and invent algorithms is think about estimators for unknown quantities, even when we don’t directly phrase things this way.  One type of estimator that we commonly encounter is the Monte Carlo estimator, which approximates expectations via the sample mean.  That is, many problems in which we are interested involve a distribution $\pi$ on a space $\mathcal{X}$, where we wish to calculate the expectation of a function $f(x)$:


\hat{f}_{\pi} &= \int_{\mathcal{X}} \pi(x)\,f(x)\,\mathrm{d}x\\

&\approx \frac{1}{N}\sum_{n=1}^N f(x_n) \text{\qquad where } x_n \sim \pi.


This is very nice because it gives you an unbiased estimator of $\hat{f}_\pi$.  That is, the expectation of this estimator is the desired quantity.  However, one issue that comes up very often is that we want to find an unbiased estimator of a quantity that is a function of an expectation.  Of course, we know that the expectation of a function is not in general a function of the expectation, so we can’t do the easy thing that we’d like to do and wind up with an unbiased estimator. Continue reading “The Poisson Estimator”

New Blog

I’m excited to announce a new collaborative blog, written by members of the Harvard Intelligent Probabilistic Systems group.  Broadly, our group studies machine learning, statistics, and computational neuroscience, but we’re interested in lots of things outside these areas as well.  The idea is to use this as a venue to discuss interesting ideas and results — new and old — about probabilistic modeling, inference, artificial intelligence, theoretical neuroscience, or anything else research-related that strikes our fancy.  There will be posts from folks at both Harvard and MIT, in computer science, mathematics, biophysics, and BCS departments, so expect a wide variety of interests.


— Ryan Adams