[latexpage]

When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter $\boldsymbol{\pi}\in\mathbb{R}^K$, such that $\pi_k\geq 0$ and $\sum_{k}\pi_k=1$. A somewhat more common occurrence is that we have a $\boldsymbol{\phi}\in\mathbb{R}^K$ where $\phi_k\geq 0$, but we don’t know the normalization constant. That is, our $\boldsymbol{\phi}$ is only proportional to the multinomial parameter $\boldsymbol{\pi}$. We want to rapidly generate a variate according to $\boldsymbol{\pi}$, given $\boldsymbol{\pi}$, something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox):

cdf = cumsum(phi);

samp_k = sum(cdf < rand()*cdf(end)) + 1;

This is nice and simple, but you'll notice that it has $\mathcal{O}(K)$ time complexity for setup (computing the CDF) and $\mathcal{O}(K)$ time complexity per sample. The per-sample complexity could probably be reduced to $\mathcal{O}(\log K)$ with a better data structure for finding the threshold. It turns out, however, that we can do better and get $\mathcal{O}(1)$ for the sampling, while still being $\mathcal{O}(K)$.

One such method is due to Kronmal and Peterson (1979) and is called the *alias* method. It is also described in the wonderful book by Devroye (1986). George Dahl, Hugo Larochelle and I used this method in our ICML paper on learning undirected n-gram models with large vocabularies.

The basic idea is to observe that any discrete distribution over $K$ outcomes can be turned into a *uniform* distribution over (possibly degenerate) binary outcomes. Since sampling from a uniform distribution can be done in constant time, it is easy to sample once you've computed an appropriate mixture. The setup cost is linear in $K$. You can convince yourself that such a mixture exists using induction. First, if $K=1$, it is clearly easy. For $K>1$, find $k_{\sf min} = \arg\min_k \pi_k$ and $k_{\sf max} = \arg\max_k \pi_k$. We know that $\pi_{k_{\sf min}}\leq 1/K$, so use these two to create a binary mixture between outcomes $k_{\sf min}$ and $k_{\sf max}$ where this component now owns all of the probability mass for $k_{\sf min}$ but only $1/K - \pi_{k_{\sf min}}$ of the mass for $k_{\sf max}$. Having done this, we now have a new discrete distribution with $K-1$ outcomes, which we can iterate until there is only one outcome.

In practice, this can be turned into a linear time algorithm in the following way (from Devroye (1986)), implemented in Python:

import numpy as np import numpy.random as npr def alias_setup(probs): K = len(probs) q = np.zeros(K) J = np.zeros(K, dtype=np.int) # Sort the data into the outcomes with probabilities # that are larger and smaller than 1/K. smaller = [] larger = [] for kk, prob in enumerate(probs): q[kk] = K*prob if q[kk] < 1.0: smaller.append(kk) else: larger.append(kk) # Loop though and create little binary mixtures that # appropriately allocate the larger outcomes over the # overall uniform mixture. while len(smaller) > 0 and len(larger) > 0: small = smaller.pop() large = larger.pop() J[small] = large q[large] = q[large] - (1.0 - q[small]) if q[large] < 1.0: smaller.append(large) else: larger.append(large) return J, q def alias_draw(J, q): K = len(J) # Draw from the overall uniform mixture. kk = int(np.floor(npr.rand()*K)) # Draw from the binary mixture, either keeping the # small one, or choosing the associated larger one. if npr.rand() < q[kk]: return kk else: return J[kk] K = 5 N = 1000 # Get a random probability vector. probs = npr.dirichlet(np.ones(K), 1).ravel() # Construct the table. J, q = alias_setup(probs) # Generate variates. X = np.zeros(N) for nn in xrange(N): X[nn] = alias_draw(J, q)

- R. A. Kronmal and A. V. Peterson. On the alias method for generating random variables from a discrete distribution. The American Statistician, 33(4):214-218, 1979.
- L. Devroye. Non-Uniform Random Random Variate Generation. Springer-Verlag, New York, 1986. (freely available!)

This is a really nice clever method.

Just one observation about the python implementation above: when updating the probability mass (line 29), wouldn’t it be better to rearrange it to:

q[large] = (q[large] + q[small]) – 1.0 or q[large] = (q[large] – 1.0) + q[small]

To minimise the rounding error?

(see http://www.keithschwarz.com/darts-dice-coins/ section: “A Practical Version of Vose’s Algorithm”)