The Central Limit Theorem

[latexpage]The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables

\begin{equation}

X_1, X_2, \ldots .

\end{equation}

Define the random variable

\begin{equation}

A_N = \frac{X_1 + \cdots + X_N}{\sqrt{N}}

\end{equation}

to be a scaled sum of the first $N$ variables in the sequence. Now, we would like to make interesting statements about the sequence

\begin{equation}

A_1, A_2, \ldots .

\end{equation} Continue reading “The Central Limit Theorem”

Optimal Spatial Prediction with Kriging

[latexpage]Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We’ve measured the latent function $Z$ at some locations ${\bf s}_1, \ldots, {\bf s}_N$, and we’d like to predict the function’s value at some new location ${\bf s}_0$. Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case.

I will assume that $Z$ is an intrinsically stationary process. In other words, there exists some semivariogram $\gamma({\bf h})$ such that

\[ \text{var}[Z({\bf s}+{\bf h}) – Z({\bf s})] = 2\gamma({\bf h}) . \]

Furthermore, I will assume that the process is isotropic, (i.e. that $\gamma({\bf h})$ is a function only of $||h||$). As Andy described here, the existence of a covariance function implies intrinsic stationarity. In addition, I will assume that the process has a constant mean, $\mathbb E[Z({\bf s})] = \mu$. We would like to estimate $Z({\bf s})$ with a linear combination of our current observations. Our estimator will be Continue reading “Optimal Spatial Prediction with Kriging”

Pseudo-marginal MCMC

[latexpage]This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi({\bf x})$, with ${\bf x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state ${\bf x}$ to a proposed state ${\bf x}’$ with probability $\min(1, \pi({\bf x}’)/\pi({\bf x}))$ .

But what if we cannot evaluate $\pi({\bf x})$ exactly? Such a situation might arise if we are given a joint density function $\pi({\bf x}, {\bf z})$, with ${\bf z} \in \mathcal Z$, and we must marginalize out ${\bf z}$ in order to compute $\pi({\bf x})$. In this situation, we may only be able to approximate

\[ \pi({\bf x}) = \int \pi({\bf x},{\bf z}) \, \mathrm d{\bf z} ,\] Continue reading “Pseudo-marginal MCMC”

An Auxiliary Variable Trick for MCMC

[latexpage]I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form

\begin{equation}

\pi({\bf x}) \propto \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) L({\bf x}),

\end{equation}

where $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ is a multivariate Gaussian prior with mean $\boldsymbol\mu$ and covariance matrix $\boldsymbol\Sigma$, and $L({\bf x})$ is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution $\pi({\bf x})$ as

\begin{equation}

\pi({\bf x}) = \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) \cdot \frac{\pi({\bf x})}{\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)},

\end{equation} Continue reading “An Auxiliary Variable Trick for MCMC”

Exponential Families and Maximum Entropy

[latexpage]An exponential family parametrized by $\boldsymbol\theta \in \mathbb R^d$ is the set of probability distributions that can be expressed as

\[ p({\bf x} \,|\, \boldsymbol\theta) =\frac{1}{Z(\boldsymbol\theta)} h({\bf x}) \exp\left( \boldsymbol\theta^{\mathsf T}\boldsymbol\phi({\bf x}) \right) ,\]

for given functions $Z(\boldsymbol\theta)}$ (the partition function), $h({\bf x})$, and $\boldsymbol\phi({\bf x})$ (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified. Continue reading “Exponential Families and Maximum Entropy”

Dealing with Reliability when Crowdsourcing

[latexpage]I recently read the paper “Variational Inference for Crowdsourcing,” by Qiang Liu, Jian Peng, and Alexander Ihler. They present an approach using belief propagation to deal with reliability when using crowdsourcing to collect labeled data. This post is based on their exposition. Crowdsourcing (via services such as Amazon Mechanical Turk) has been used as a cheap way to amass large quantities of labeled data. However, the labels are likely to be noisy.

To deal with this, a common strategy is to employ redundancy: each task is labeled by multiple workers. For simplicity, suppose there are $N$ tasks and $M$ workers, and assume that the possible labels are $\{\pm 1\}$. Define the $N \times M$ matrix $L$ so that $L_{ij}$ is the label given to task $i$ by worker $j$ (or $0$ if no label was provided). Let $z = (z_1, \ldots, z_N)$ be the true labels of the tasks. Given $L$, we wish to come up with an estimator $\hat{z}$ of $z$ so as to minimize the average error

\begin{align}

\frac{1}{N} \sum_{i=1}^N \text{prob}[\hat{z}_i \ne z_i] .

\end{align} Continue reading “Dealing with Reliability when Crowdsourcing”

A Continuous Approach to Discrete MCMC

[latexpage]Continuous problems are often simpler to solve than discrete problems. This is true in many optimization problems (for instance, linear programming versus integer linear programming). In the case of Markov chain Monte Carlo (MCMC), sampling continuous distributions has some advantages over sampling discrete distributions due to the availability of gradient information in the continuous case. The paper “Continuous Relaxations for Discrete Hamiltonian Monte Carlo” by Yichuan Zhang, Charles Sutton, Amos Storkey, and Zoubin Ghahramani explores the idea of performing inference in the discrete setting by deriving and sampling a related continuous distribution. Here I describe the approach taken in this paper. Continue reading “A Continuous Approach to Discrete MCMC”

Learning Image Features from Video

While at NIPS, I came across the paper Deep Learning of Invariant Features via Simulated Fixations in Video by Will Zou, Shenghuo Zhu, Andrew Ng,  and Kai Yu. It proposes a particularly appealing unsupervised method for using videos to learn image features. Their method appears to be somewhat inspired by the human visual system. For instance, people have access to video data, not static images. They also attempt to mimic the human tendency to fixate on particular objects. They track objects through successive frames in order to provide more coherent data to the learning algorithm.

The authors use a stacked architecture, where each layer is trained by optimizing an embedding into a feature space. As usual, the optimization problem involves a reconstruction penalty and a sparsity penalty. In addition, however, it includes a temporal slowness penalty, which seeks to minimize the \(L_1\) norm between the feature representations of consecutive frames. This enforces the intuition that good representations of images should change slowly as the images deform. Using this approach, the authors achieve improved performance on various classification tasks. Continue reading “Learning Image Features from Video”