# The Alias Method: Efficient Sampling with Many Discrete Outcomes

[latexpage] When implementing algorithms for inference and learning with probabilistic models, it commonly comes up that one needs to sample from a discrete distribution. That is, from a multinomial distribution with parameter $\boldsymbol{\pi}\in\mathbb{R}^K$, such that $\pi_k\geq 0$ and $\sum_{k}\pi_k=1$. A somewhat more common occurrence is that we have a $\boldsymbol{\phi}\in\mathbb{R}^K$ where $\phi_k\geq 0$, but we don’t know the normalization constant. That is, our $\boldsymbol{\phi}$ is only proportional to the multinomial parameter $\boldsymbol{\pi}$. We want to rapidly generate a variate according to $\boldsymbol{\pi}$, given $\boldsymbol{\pi}$, something easily done with (Matlab) code such as this (paraphrased from Tom Minka‘s Lightspeed Toolbox):
 cdf = cumsum(phi); samp_k = sum(cdf < rand()*cdf(end)) + 1; 
This is nice and simple, but you'll notice that it has $\mathcal{O}(K)$ time complexity for setup (computing the CDF) and $\mathcal{O}(K)$ time complexity per sample. The per-sample complexity could probably be reduced to $\mathcal{O}(\log K)$ with a better data structure for finding the threshold. It turns out, however, that we can do better and get $\mathcal{O}(1)$ for the sampling, while still being $\mathcal{O}(K)$.

One such method is due to Kronmal and Peterson (1979) and is called the alias method. It is also described in the wonderful book by Devroye (1986). George Dahl, Hugo Larochelle and I used this method in our ICML paper on learning undirected n-gram models with large vocabularies.

The basic idea is to observe that any discrete distribution over $K$ outcomes can be turned into a uniform distribution over (possibly degenerate) binary outcomes. Since sampling from a uniform distribution can be done in constant time, it is easy to sample once you've computed an appropriate mixture. The setup cost is linear in $K$. You can convince yourself that such a mixture exists using induction. First, if $K=1$, it is clearly easy. For $K>1$, find $k_{\sf min} = \arg\min_k \pi_k$ and $k_{\sf max} = \arg\max_k \pi_k$. We know that $\pi_{k_{\sf min}}\leq 1/K$, so use these two to create a binary mixture between outcomes $k_{\sf min}$ and $k_{\sf max}$ where this component now owns all of the probability mass for $k_{\sf min}$ but only $1/K - \pi_{k_{\sf min}}$ of the mass for $k_{\sf max}$. Having done this, we now have a new discrete distribution with $K-1$ outcomes, which we can iterate until there is only one outcome.

In practice, this can be turned into a linear time algorithm in the following way (from Devroye (1986)), implemented in Python:

import numpy        as np
import numpy.random as npr

def alias_setup(probs):
K       = len(probs)
q       = np.zeros(K)
J       = np.zeros(K, dtype=np.int)

# Sort the data into the outcomes with probabilities
# that are larger and smaller than 1/K.
smaller = []
larger  = []
for kk, prob in enumerate(probs):
q[kk] = K*prob
if q[kk] < 1.0:
smaller.append(kk)
else:
larger.append(kk)

# Loop though and create little binary mixtures that
# appropriately allocate the larger outcomes over the
# overall uniform mixture.
while len(smaller) > 0 and len(larger) > 0:
small = smaller.pop()
large = larger.pop()

J[small] = large
q[large] = q[large] - (1.0 - q[small])

if q[large] < 1.0:
smaller.append(large)
else:
larger.append(large)

return J, q

def alias_draw(J, q):
K  = len(J)

# Draw from the overall uniform mixture.
kk = int(np.floor(npr.rand()*K))

# Draw from the binary mixture, either keeping the
# small one, or choosing the associated larger one.
if npr.rand() < q[kk]:
return kk
else:
return J[kk]

K = 5
N = 1000

# Get a random probability vector.
probs = npr.dirichlet(np.ones(K), 1).ravel()

# Construct the table.
J, q = alias_setup(probs)

# Generate variates.
X = np.zeros(N)
for nn in xrange(N):
X[nn] = alias_draw(J, q)

• R. A. Kronmal and A. V. Peterson. On the alias method for generating random variables from a discrete distribution. The American Statistician, 33(4):214-218, 1979.
• L. Devroye. Non-Uniform Random Random Variate Generation. Springer-Verlag, New York, 1986. (freely available!)