# Method of moments

[latexpage]

The method of moments is a simple idea for estimating the parameters of a model. Suppose the observed data are sampled iid from a distribution $p(x|\theta)$, our goal is to estimate the parameter $\theta$. If we have enough data, we can get very close to the true mean of the distribution, $E[x]$, which is a  function of $\theta$: $E[x]=f_1(\theta)$. We know the form of $f_1$ since we know the form of $p(x|\theta)$.

For simple distributions, knowing just the mean is enough to invert $f_1$ to obtain $\theta$. In general, we need to calculate the higher moments, which are also known functions of $\theta$: $E[x^2] = f_2(\theta), …, E[x^n]=f_n(\theta)$. We then try to invert the systems of equations $f_1, …, f_n$ to obtain $\theta$. In practice, we typically only have enough data to accurately estimate the low order ($\leq 3$) moments.

This idea has been used in a series of papers Anima Anandkumar, Daniel Hsu, Sham Kakade, and colleagues to estimate parameters in latent variable models. A very nice introduction of the method is here. Its claim to fame is that if the modeling assumptions are correct and we have enough data, then the method is guaranteed to find the global optimal parameters. This contrasts with popular methods like EM which is often stuck at local optimum.

A particularly cool application is that we can learn all the topics of a Latent Dirichlet Allocation (LDA) model using only the first three moments.  This sounds too good to be true, but a back of the envelope calculation shows that it’s actually reasonable. In LDA, each topic is a multinomial distribution over the vocabulary. So say the vocabulary is ~10,000 words and we want to estimate 100 topics, then we have ~ 1 million parameters. The third moments contains all probabilities of the form p(word1, word2, word3), where word1,2,3 can be any words in the vocabulary. Assuming we can estimate the third moments, then we have $~10,000^3=10^{12}$ data points to estimate $10^6$ parameters. Certainly plausible!

The method is quite intuitive. Suppose for simplicity we have an LDA model where each document has exactly one topic. Let $\mu_k$ denote the (true) distribution for topic $k$. The second moment $M_2$ is just the $NxN$ matrix (N=vocab size), where the  $(i,j)$ entry is p(word_i, word_j). This can be written as a sum of outer products, $M_2 = \sum_{k=1}^K \mu_k \mu_k^{T}$. Similarly the third moment is a third order tensor, which generalizes the outer product, $M_3=\sum_{k=1}^{K} \mu_k \bigotimes \mu_k \bigotimes \mu_k$. The key point to notice is that $M_3$ is a rank K tensor, whose eigenvectors are simply $\mu_k$! Once we observe this, the rest of the algorithm comes down to using power iteration methods to recover the tensor eigenvectors, which give us the topic distributions. Very cool.