Bayesian nonparametrics in the real world

Nick Foti Machine Learning, Statistics Leave a Comment

[latexpage]

Bayesian nonparametrics allow the contruction of statistical models whose complexity is determined by the observed data. This is accomplished by specifying priors over infinite dimensional distributions. The most widely used Bayesian nonparametric priors in machine learning are the Dirichlet process, the beta process and their corresponding marginal processes the Chinese restaurant process and the Indian buffet process respectively. The Dirichlet process provides a prior for the mixing measure of an infinite mixture model and the beta process can be used as a prior for feature popularity in a latent feature model. The hierarchical Dirichlet process (HDP) also appears frequently in machine learning underlying topic models with an infinite number of topics. A main selling point of Bayesian nonparametrics has been that the complexity of the model can grow as more data is observed. While true in theory, current inference algorithms for Bayesian nonparametric models fail to scale to the massive data sets available making the model impractical. In this post I review traditional inference algorithms for Bayesian nonparametric models and discuss a new approach for handling massive data sets using stochastic variational inference.

Inference in Bayesian nonparametric models is often performed with Markov chain Monte Carlo (MCMC). Many different MCMC methods have been developed to sample from the posterior of interest, examples include Gibbs samplers, slice sampling and retrospective sampling. In fully conditionally conjugate Bayesian nonparametric models these MCMC algorithms can be quite efficient for data sets of moderate size. However, these methods do not scale to massive data sets. There does seem to be a growing body of work on parallelizing these algorithms in order to scale to larger data sets which may be a promising direction. Alternatively, sequential Monte Carlo (SMC) algorithms have been developed as scalable inference algorithms.

Variational inference is an optimization-based alternative to MCMC. The posterior distribution of interest is replaced with a distribution that is simpler to work with called the variational distribution. Most variational algorithms in machine learning utilize the so-called mean-field assumption and assume a fully factorized version of the posterior. Variational algorithms have been developed for many common Bayesian nonparametric models including Dirichlet process mixtures, beta process latent feature models and HDP topic models. However, with few exceptions the algorithms use a truncated representation thus losing the benefit of being nonparametric. Traditionally, variational algorithms are used in a batch fashion assuming all data is observed.  The model parameters are optimized using coordinate ascent to make the variational distribution close (in KL divergence) to the true posterior distribution. In this setting updating the model parameters usually involves all of the data which will scale poorly with the number of observations.

Stochastic variational inference has been introduced as a method to perform variational inference for massive or streaming data sets [1]. The algorithm works by updating observation-specific parameters using coordinate ascent as in standard variational inference. When updating global parameters though, stochastic gradient ascent is used, where a noisy version of the gradient is followed to update the current value of the global parameter. As long as the noisy gradient is equal to the true gradient in expectation and we take smaller and smaller steps in the direction of the noisy gradient the algorithm will converge. In [1] a stochastic variational inference algorithm was derived for a truncated HDP topic model, extending the work in [2] which used stochastic variational inference for applying Latent Dirichlet Allocation to massive text corpora.

Recently, stochastic variational inference has been extended for an infinite HDP topic model in [3]. The key insight of the paper was to use an infinite variational distribution while only allowing a finite number of topics to be used. Specifically, fix a number of topics $K$ and let $q(z_{nj} = k)$ be the probability under the variational distribution that word $j$ in document $n$ is assigned to topic $k$. Assume that $q(z_{nj} = k) = 0$ for $k > K$ so that all infinite dimensional measures in the model (global topic proportions and per-document topic proportions) can be represented as $K+1$-dimensional vectors where the extra dimension contains the probability of a new topic. Under this variational distribution a model with $K$ topics is nested in a model with $K+1$ topics, which does not occur in truncated models as extra probability is assigned to the last mass. The fact that models are nested means that we can compare the variational lower-bound for models with different numbers of topics and can safely add or remove topics in the variational framework. In [3] the authors proposed transdimensional split-merge moves to break a topic into two or combine two topics into one. They found that their algorithm converged to better optima of the variational lower-bound than previous variational algorithms and Gibbs sampling. An alternative infinite online variational algorithm was presented in [4] and may be discussed in a future post.

The infinite online variational algorithm for an HDP topic model discussed is a step toward using Bayesian nonparametrics for massive or streaming data. However, there are drawbacks to the current algorithms, in particular the number of observations (documents for a topic model) must be known in advance. This makes the  infinite online variational algorithms of [1] and [3] unable to handle truly infinite data streams. I think this is an exciting area of research with a lot of interesting work to be done, and perhaps soon we will see Bayesian nonparametrics being used in the real world.

[1] Hoffman, M., Blei, D. M., Wang, C., Paisley, J., Stochastic Variational Inference.  arXiv: 1206.7051.
[2] Hoffman, M., Blei, D.M., Bach, F., Online learning for latent Dirichlet allocation.  In NIPS 2010.
[3] Bryant, M., Sudderth, E., Truly Nonparametric Variational Inference for Hierarchical Dirichlet Processes.  In NIPS 2012.
[4] Wang, C., Blei, D. M., Truncation-free stochastic variational inference for Bayesian nonparametric models.  In NIPS 2012.

Leave a Reply