Jeff Miller and Matthew Harrison at Brown (go Bears!) have recently explored the posterior consistency of Dirichlet process mixture (DPM) models, emphasizing one particular drawback.
For setup, say you have some observed data $x_1, ..., x_n$ from a mixture of two normals, such as
$$p(x|\theta) = \frac{1}{2}\mathcal{N}(x|0,1) + \frac{1}{2}\mathcal{N}(x|6,1)$$
In this case, the number of clusters, $s$, is two, and one would imagine that as $n$ grows, the posterior distribution of $s$ would converge to 2, i.e. $p(s=2|x_{1:n}) \rightarrow 1$. However, this is not true if you model the data with a DPM (or more generally, modeling the mixing measure as a Dirichlet process, $Q \sim DP$).
In this note, the authors show that the DPM is inconsistent for the number of clusters in a single Gaussian. In fact, in this case the inconsistency is so severe that $p(s=1 | x) \stackrel{P}{\rightarrow} 0$ as $n \rightarrow \infty$.
So if $Q \sim DP(\alpha, H)$ doesn't lead to convergence on the number of clusters, then what does? Miller and Harrison bring attention to a viable alternative called the Mixture of Finite Mixtures (MFM). The mixing measure is generated by:
- $S \sim p(s)$, a pmf on $\{1,2,...\}$
- $\pi | S=s \sim Dirichlet(\alpha_{s1}, ..., \alpha_{ss})$
- $\theta_1, ..., \theta_s | S=s \sim H$
- $Q = \sum_{i=1}^S \pi_i \delta_{\theta_i}$.
This alternative mixing measure generation provides a consistent posterior on the number of clusters.
At the NIPS Workshop on Modern Nonparametric Methods in Machine Learning, Jeff presented his poster, which nicely outlines some of the advantages of the MFM model, as well as analogous reinterpretations of the model. For example, there exists a Chinese restaurant-like process for MFMs that is comparable to the DPM. Furthermore, the stick-breaking construction for MFMs boils down to a Poisson process on the unit interval. The poster also provides some visual intuition behind why the DP tends to overestimate the number of clusters.
So if what's important to you is the number of latent clusters you might be better off using the MFM for posterior inference. It would be interesting to see how reversible jump MCMC or other trans-dimensional model selection methods compare to MFM posterior inference for recovering the number of clusters.