## The Central Limit Theorem

[latexpage]The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables

X_1, X_2, \ldots .

Define the random variable

A_N = \frac{X_1 + \cdots + X_N}{\sqrt{N}}

to be a scaled sum of the first $N$ variables in the sequence. Now, we would like to make interesting statements about the sequence

A_1, A_2, \ldots .

## The Gumbel-Max Trick for Discrete Distributions

[latexpage]It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the “softmax” transformation:
\begin{align*}
\pi_k &= \frac{\exp\{x_k\}}{\sum_{k’=1}^K\exp\{x_{k’}\}}
\end{align*}
where the $x_k$ are unconstrained in $\mathbb{R}$, but the $\pi_k$ live on the simplex, i.e., $\pi_k \geq 0$ and $\sum_{k}\pi_k=1$. The $x_k$ parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such discrete samples, that doesn’t actually require constructing the discrete distribution.

## Pseudo-marginal MCMC

[latexpage]This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi({\bf x})$, with ${\bf x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state ${\bf x}$ to a proposed state ${\bf x}’$ with probability $\min(1, \pi({\bf x}’)/\pi({\bf x}))$ .

But what if we cannot evaluate $\pi({\bf x})$ exactly? Such a situation might arise if we are given a joint density function $\pi({\bf x}, {\bf z})$, with ${\bf z} \in \mathcal Z$, and we must marginalize out ${\bf z}$ in order to compute $\pi({\bf x})$. In this situation, we may only be able to approximate

$\pi({\bf x}) = \int \pi({\bf x},{\bf z}) \, \mathrm d{\bf z} ,$ Continue reading “Pseudo-marginal MCMC”

## Chernoff’s bound

[latexpage]

If you have some randomness in your life, chances are that you want to try Chernoff’s bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff’s bound or its cousins).

My favorite form of Chernoff’s bound is: for $X_1, …, X_n$ independent binary random variables, and $X=\sum_{i=1}^n X_i$, $\mu = E[X]$ and $\delta > 2e-1$, then

$P(X > (1+\delta)\mu) < 2^{-\delta \mu}.$

Note that $X_i’s$ are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so $\delta$ is typically larger than $2e-1$.

In the standard applications, the stochastic system has size $N$ and an event of interest, $X$, has expectation $\mu=O(\log N)$. The probability that any one event deviates significantly is inverse polynomial in the size of the whole system

$P(X>(1+\delta)\mu) < \frac{1}{N^{\delta}}.$

This is important since the total number of events is polynomial in $N$ for many settings. By simple Union Bound,

$P(\mbox{any } X>(1+\delta)\mu) = O(\frac{1}{N}).$

So the chance of any deviation in any event is small!

## Variational Inference (part 1)

[latexpage]

I will dedicate the next few posts to variational inference methods as a way to organize my own understanding – this first one will be pretty basic.

The goal of variational inference is to approximate an intractable probability distribution, $p$, with a tractable one, $q$, in a way that makes them as ‘close’ as possible. Let’s unpack that statement a bit.

## An Auxiliary Variable Trick for MCMC

[latexpage]I recently uploaded the paper “Parallel MCMC with Generalized Elliptical Slice Sampling” to the arXiv. I’d like to highlight one trick that we used, but first I’ll give some background. Markov chain Monte Carlo (MCMC) is a class of algorithms for generating samples from a specified probability distribution $\pi({\bf x})$ (in the continuous setting, the distribution is generally specified by its density function). Elliptical slice sampling is an MCMC algorithm that can be used to sample distributions of the form

\pi({\bf x}) \propto \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) L({\bf x}),

where $\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)$ is a multivariate Gaussian prior with mean $\boldsymbol\mu$ and covariance matrix $\boldsymbol\Sigma$, and $L({\bf x})$ is a likelihood function. Suppose we want to generalize this algorithm to sample from arbitrary continuous probability distributions. We could simply factor the distribution $\pi({\bf x})$ as

\pi({\bf x}) = \mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma) \cdot \frac{\pi({\bf x})}{\mathcal N({\bf x};\boldsymbol\mu,\boldsymbol\Sigma)},

## A Geometric Intuition for Markov’s Inequality

[latexpage]

As the title of the post suggests, this week I will discuss a geometric intuition for Markov’s inequality, which for a nonnegative random variable, $X$, states
$$P(X \geq a) \leq E[X]/a.$$
This is a simple result in basic probability that still felt surprising every time I used it… until very recently. (Warning: Basic measure theoretic probability lies ahead. These notes look like they provide sufficient background if this post is confusing and you are sufficiently motivated!)
Continue reading “A Geometric Intuition for Markov’s Inequality”

## High-Dimensional Probability Estimation with Deep Density Models

[latexpage]

Ryan Adams and I just uploaded to the arXiv our paper “High-Dimensional Probability Estimation with Deep Density Models”. In this work, we introduce the deep density model (DDM), a new approach for density estimation. Continue reading “High-Dimensional Probability Estimation with Deep Density Models”

## Exponential Families and Maximum Entropy

[latexpage]An exponential family parametrized by $\boldsymbol\theta \in \mathbb R^d$ is the set of probability distributions that can be expressed as

$p({\bf x} \,|\, \boldsymbol\theta) =\frac{1}{Z(\boldsymbol\theta)} h({\bf x}) \exp\left( \boldsymbol\theta^{\mathsf T}\boldsymbol\phi({\bf x}) \right) ,$

for given functions $Z(\boldsymbol\theta)}$ (the partition function), $h({\bf x})$, and $\boldsymbol\phi({\bf x})$ (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified. Continue reading “Exponential Families and Maximum Entropy”

## The Fundamental Matrix of a Finite Markov Chain

[latexpage]
The purpose of this post is to present the very basics of potential theory for finite Markov chains. This post is by no means a complete presentation but rather aims to show that there are intuitive finite analogs of the potential kernels that arise when studying Markov chains on general state spaces. By presenting a piece of potential theory for Markov chains without the complications of measure theory I hope the reader will be able to appreciate the big picture of the general theory.
Continue reading “The Fundamental Matrix of a Finite Markov Chain”