[latexpage] I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic. The goal of variational inference is to approximate an intractable probability distribution, $p$, with a tractable one, $q$, in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.

## Variograms, Covariance functions and Stationarity

[latexpage] I just started a course on spatial statistics, so I’ve got covariance functions and variograms on the mind. This post is mostly for me to work through their intuition and relationship. Say you have some spatio-temporal process, with specific locations denoted $s_1, s_2, \dots$, with the value of the process those points are $z(s_1), z(s_2), \dots$. For concreteness, these locations could be latitude and longitude and the field could be the outdoor temperature. Or maybe the locations are the the space-time $(x,y,t)$ of a player on a basketball court and the field is her shot percentage or scoring efficiency from that point.

## DPMs and Consistency

[latexpage] Jeff Miller and Matthew Harrison at Brown (go Bears!) have recently explored the posterior consistency of Dirichlet process mixture (DPM) models, emphasizing one particular drawback. For setup, say you have some observed data $x_1, …, x_n$ from a mixture of two normals, such as $$p(x|\theta) = \frac{1}{2}\mathcal{N}(x|0,1) + \frac{1}{2}\mathcal{N}(x|6,1)$$ In this case, the number of clusters, $s$, is two, and one would imagine that as $n$ grows, the posterior distribution of $s$ would converge to 2, i.e. $p(s=2|x_{1:n}) \rightarrow 1$. However, this is not true if you model the data with a DPM (or more generally, modeling the mixing measure as a Dirichlet process, $Q \sim DP$).

## Nonparanormal Activity

[latexpage] Say you have a set of $n$ $p$-dimensional iid samples $\{ \textbf x_i \}_{i=1}^n$ drawn from some unknown continuous distribution that you want to estimate with an undirected graphical model. You can sometimes get away with assuming the $\textbf x_i$’s are drawn from a multivariate normal (MVN), and from there you can use a host of methods for estimating the covariance matrix $\Sigma$, and thus the graph structure $\Omega = \Sigma^{-1}$ (perhaps imposing sparsity constraints for inferring structure in high dimensional data, $n<<p$). In other cases the Gaussian assumption is too restrictive (e.g. when marginals exhibit multimodal behavior). One way to augment the expressivity of the MVN while maintaining some of the desirable properties is to assume that some …