# The Central Limit Theorem

[latexpage]The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables $X_1, X_2, \ldots$. Define the random variable

A_N = \frac{X_1 + \cdots + X_N}{\sqrt{N}}

to be a scaled sum of the first $N$ variables in the sequence. Now, we would like to make interesting statements about the sequence

A_1, A_2, \ldots .

The central limit theorem is quite general. To simplify this exposition, I will make a number of assumptions. First, I will assume that the $X_i$ are independent and identically distributed. Second, I will assume that each $X_i$ has mean $\mathbb E[X_n] = 0$ and variance $\textnormal{var}(X_n) = \sigma^2$. Finally, I will assume that the moment generating function (to be defined below) converges (this condition requires all moments of the distribution to exist).

Under these conditions, the central limit theorem tells us that

A_1, A_2, \ldots \to \mathcal N(0,\sigma^2) ,

where $N(\mu, \sigma^2)$ is the normal distribution with density function

\mathcal N(x ; \mu, \sigma^2) = \frac{1}{ \sigma \sqrt{2 \pi}} \exp\left(-\frac{ (x-\mu)^2}{2\sigma^2}\right) ,

and where $P_1 \to P_2$ means that $P_1([a,b]) \to P_2([a,b])$ for all intervals $[a,b] \subset \mathbb R$. It is not immediately obvious that the sequence $\{A_n\}$ should converge, but if it does converge, the normal distribution is the natural candidate. Suppose it converges to some distribution $D$. Now, consider the random variable

B_N = \frac{1}{\sqrt{2}} \left(\frac{X_1 + \cdots + X_N}{\sqrt{N}} + \frac{X_{N+1} + \cdots X_{2N}}{\sqrt{N}} \right) .

Of course, $B_N$ is just $A_{2N}$, so $B_N \to D$. But the first term on the RHS is $A_N$, and the second term on the RHS has the same distribution as $A_N$, so we have

B_N \to \frac{D + D}{\sqrt{2}} ,

where the two $D$'s are independent random variables with the same distribution. So $D$ and $(D + D)/\sqrt{2}$ must have the same distribution. By grouping terms in different proportions, we can derive similar properties of $D$. Since the normal distribution satisfies

\begin{eqnarray}

\mathcal N(\mu_1, \sigma_1^2) + \mathcal N(\mu_2, \sigma_2^2) & = & \mathcal N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \\

c \mathcal N(\mu, \sigma^2) & = & \mathcal N(c\mu, c^2\sigma^2) ,

\end{eqnarray}

it is a natural candidate for the distribution $D$.

To prove the central limit theorem, we will make use of the moment generating function

M_{X}(t) = \mathbb E [\exp (tX) ] = 1 + \mathbb E[X] t + \frac{1}{2} \mathbb  E[X^2] t^2 + \cdots

and the cumulant generating function

K_{X}(t) = \log \mathbb E[\exp (tX) ] = \mathbb E[X] t + \frac{\mathbb E[X^2] - \mathbb E[X]^2}{2} t^2 + \cdots .

The coefficients of the moment generating function and the cumulant generating function (divided by $n!$) are referred to as "moments" and "cumulants" respectively. The cumulants and the moments are closely related, and the values of one determine the values of the other. Incidentally, the moment generating function and the cumulant generating function of the normal distribution $\mathcal N(\mu,\sigma^2)$ are given by

\begin{eqnarray}

M(t) & = & \exp\left(\mu t + \frac{\sigma^2}{2} t^2 \right) \\

K(t) & = & \mu t + \frac{\sigma^2}{2} t^2 .

\end{eqnarray}

Note that the moment generating function satisfies

\begin{eqnarray}

M_{X+Y}(t) & = & M_{X}(t) M_{Y}(t) \\

M_{cX}(t) & = & M_{X}(ct) .

\end{eqnarray}

It follows that the cumulant generating function satisfies

\begin{eqnarray}

K_{X+Y}(t) & = & K_{X}(t) + K_{Y}(t) \\

K_{cX}(t) & = & K_{X}(ct) .

\end{eqnarray}

Now, we are going to use all of these tools. Let's inspect the cumulant generating function of $A_N$. We have

K_{A_N}(t) = K_{X_1}\left(\frac{t}{\sqrt{N}}\right) + \cdots + K_{X_N}\left(\frac{t}{\sqrt{N}}\right) .

Let $k^m_{X}$ be the $m$th coefficient of $K_{X}(t)$. Equating powers of $t$ above, we get

k^m_{A_N} = \left(k^m_{X_1} + \cdots + k^m_{X_N} \right) N^{-\frac{m}{2}} .

From the case $m=1$, we see that

\mathbb E[A_N] = \frac{\mathbb E[X_1] + \cdots \mathbb E[X_N]}{\sqrt{N}} = 0,

as expected. From the case $m=2$, we see that

\textnormal{var}(A_N) = \frac{\textnormal{var}(X_1) + \cdots \textnormal{var}(X_N)}{N} = \sigma^2,

also as expected. But what about higher cumulants? For $k > 2$, as $N \to \infty$, we have

k^m_{A_N} = k^m_{X_1}  N^{1-\frac{N}{2}} \to 0 .

Therefore, the higher cumulants all vanish. It follows that the cumulants of the sequence $\{A_N\}$ converge to the cumulants of $\mathcal N(0,\sigma^2)$. Therefore, the moments of the sequence $\{A_N\}$ converge to the moments of $\mathcal N(0,\sigma^2)$. It follows from Levy's continuity theorem, that $A_N \to \mathcal N(0,\sigma^2)$, as desired.

# JIT compilation in MATLAB

A few years ago MATLAB introduced a Just-In-Time (JIT) accelerator under the hood. Because the JIT acceleration runs behind the scenes, it is easy to miss (in fact, MathWorks seems to intentionally hide it so that users do not change their coding style, probably because the JIT accelerator is changed regularly). I just wanted to briefly mention what a JIT accelerator is and what it does in MATLAB. JIT compilers compile chunks of code at runtime, so instead of interpreting the MATLAB code line-by-line (the default for an interpreted language), it compiles certain chunks of code and then runs the compiled chunks as a block. The JIT acceleration in Matlab works quite well. I wrote the following code to demonstrate MATLAB's JIT in action:

N = 1e7;


# Introspection in AI

I've recently come across a fascinating blog post by Cambridge mathematician Tim Gowers. He and computational linguist Mohan Ganesalingam built a sort of automated mathematician which does the kind of "routine" mathematical proofs that mathematicians can do without backtracking. Their system was based on a formal theory of the semantics of mathematical language, together with introspection into how they solved problems. In other words, they worked through lots of simple examples and checked that their AI could solve the problems in a way that was cognitively plausible. The goal wasn't to build a useful system (standard theorem provers are way more powerful), but to provide insight into our problem solving process. This post reminded me that, while our field has long moved away from this style of research, I think there's still a lot to be gained from it.

Introspection has a long and distinguished history in the early days of AI:

• Newell and Simon's General Problem Solver [1] was explicitly based on intuitions about how we solve problems, and they developed it by working through lots of examples by hand
• Lenat's Automated Mathematician [2] was supposed to make mathematical conjectures based on heuristics that human mathematicians followed
• Hofstadter's book Fluid Concepts and Creative Analogies [3] details AI systems which serve as cognitive models for a variety of toy problems. Some nice examples are Seek-Whence (for explaining number sequences) and Copycat (for letter string analogies).

Like Gowers's work, these classic systems were intended primarily as cognitive models. Simon was inspired to work on AI when he did consulting work on group decision making for the air force and found that he didn't have an adequate formalism for explaining their decision making process [4]. Lenat was at least partially motivated by formalizing Polya's notion of a heuristic, since he thought informal lists of heuristics probably oversimplified the process of mathematical discovery and suffered from hindsight bias [5]. The idea was that translating our own heuristics into mathematically precise algorithms gives a more rigorous way of testing which strategies are capable of solving which problems.

But introspection has largely fallen out of favor in AI, at least the parts I'm familiar with (machine learning, vision, probabilistic inference). The field has a Behaviorist-like aversion to thinking about algorithms in terms of plausible mental representations and cognitive strategies. I can think of several plausible arguments against introspection:

• We already have rough outlines of a solution for most of the problems we're interested in, and introspection won't help us fill in any of the details.
• Computers are a different enough computational architecture from the brain that we can't expect any insights to carry over. I.e., computers can perform billions of sequential operations per second (which often makes brute force a more viable strategy), whereas the brain has more effective associative memory and high-level representations.
• Introspection isn't scientific, since different people will disagree. The explanations are likely just someone's idealized story about what's actually a much messier process. (E.g., cognitive scientist Jeff Shrager spent three years training to be a molecular biologist, and during this time, he couldn't find any examples of the cross-domain analogies that cognitive scientists had touted as so fundamental to scientific discovery.)
• It's hard to predict what range of problems an introspection-based strategy will generalize to.

As far as the first point, I think there are still plenty of hard problems for which we don't even have the outlines of a solution. The remaining points seem hard to argue with, but they basically just show that introspection is a source of intuition rather than a rigorous way of evaluating algorithms. I'd say we have three major sources of intuition in our field:

1. trying out models and algorithms in a variety of settings, both real-world applications and toy problems
2. peer-reviewed research in cognitive science and neuroscience
3. introspection

Neuroscience and cog sci both had a large influence on both machine learning and computer vision in the early days, but as the field as matured, algorithms have made up a larger and larger part of our intuitions. Introspection is rarely mentioned.

In my own work on structure learning, my co-authors and I made several design decisions based on how human researchers would approach a problem. In my matrix decompositions work [6]:

• we often come up with structured probabilistic models by first fitting a simpler one to the data, seeing if there are dependencies in the learned representation which aren't captured by the model, and adding those explicitly to the model
• if we're trying to do inference in a complex probabilistic model, it often works to initialize it from a simpler one

And for Gaussian process structure search [7]:

• a good strategy for building up structured GP models is to try to model the residuals of the current model, and add the resulting kernel to the model
• when we use the above strategy, the previously learned kernel hyperparameters make a good initialization for the larger model

Of course, for the aspects of our algorithms that have already been well-explored (such as inference in particular models), we went with the standard approaches. There's no point in thinking, "How would I try to factorize this matrix..." But since hardly anyone has explored how to structure model spaces in a way that's searchable, these intuitions provided useful guidance in the design of our spaces of models.

Like any source of intuition, introspection is just a starting point. It's not a rigorous scientific argument, and it ultimately needs to be backed up with math and experiments. What we really want is an explanation for the heuristics people use. In other words,

1. figure out what regularities our heuristics are exploiting
2. see if existing algorithms already exploit those same regularities
3. if not, come up with one that does
4. justify it theoretically and empirically

You can think of this as a sort of "inverse Bayesian optimization." I'm currently working on taking the heuristics from my structure search papers and explaining mathematically why they work, so that we know in what set of situations they are appropriate. Hopefully this will result in a more general and more robust way to exploit the same structure.

[1] A. Newell, J. C. Shaw, and H. A. Simon. Report on a general problem-solving program. Tech report, 1958

[2] D. B. Lenat and J. S. Brown. Why AM and Eurisko appear to work. Artificial Intellgience, 1984.

[3] D. R. Hofstadter. Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought. Basic Books, 1996.

[4] H. A. Simon. Models of My Life. MIT Press, 1996.

[5] D. B. Lenat. The Nature of Heuristics. Tech report, 1980.

[6] R. B. Grosse, R. Salakhutdinov, W. T. Freeman, and J. B. Tenenbaum. Exploiting compositionality to explore a large space of model structures. UAI, 2012.

[7] D. Duvenaud, J. R. Lloyd, R. B. Grosse, J. B. Tenenbaum, Z. Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. ICML, 2013.

# Machine Learning Glossary

I often have a hard time understanding the terminology in machine learning, even after almost three years in the field. For example, what is a Deep Belief Network? I attended a whole summer school on Deep Learning, but I'm still not quite sure. I decided to take a leap of faith and assume this is not just because the Deep Belief Networks in my brain are not functioning properly (although I am sure this is a factor). So, I created a Machine Learning Glossary to try to define some of these terms. The glossary can be found here. I have tried to write in an unpretentious style, defining things systematically and leaving no "exercises to the reader". I also have a form for readers to request new definitions.

Because I did not bother learning how to render equations on my web site, I have had to write without equations. While annoying at times, I think this may actually help with the clarity of the definitions because I cannot hide behind equations and pretend that they explain everything. The lack of equations also keeps things at a high-level picture, which is the goal of the glossary.

At the moment the glossary is not a Wiki, but perhaps I will move to that model in the future. For now, it is an experiment and I hope it is helpful to at least a few people. If you have any thoughts on how I could improve the glossary, I would be interested to hear them!

# Optimal Spatial Prediction with Kriging

Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We've measured the latent function $Z$ at some locations $\mathbf{s}_1, \ldots, \mathbf{s}_N$, and we'd like to predict the function's value at some new location $\mathbf{s}_0$. Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case.

I will assume that $Z$ is an intrinsically stationary process. In other words, there exists some semivariogram $\gamma({\bf h})$ such that

$\text{var}[Z(\mathbf{s}+{\bf h}) - Z(\mathbf{s})] = 2\gamma({\bf h}) .$

Furthermore, I will assume that the process is isotropic, (i.e. that $\gamma({\bf h})$ is a function only of $||h||$). As Andy described here, the existence of a covariance function implies intrinsic stationarity. In addition, I will assume that the process has a constant mean, $\mathbb E[Z(\mathbf{s})] = \mu$. We would like to estimate $Z(\mathbf{s})$ with a linear combination of our current observations. Our estimator will be

$\hat{Z} = \sum_{n=1}^N \lambda_n Z(\mathbf{s}_n) ,$

where the weights $\lambda_n$ can be positive or negative. We further require that $\sum_n \lambda_n = 1$ so that our estimate is unbiased. We would like to choose the weights $\lambda_n$ so as to minimize the mean-squared predictive error

$\text{MSPE}(\lambda_1, \ldots, \lambda_N) = \mathbb E[(\hat{Z} - Z(\mathbf{s}) )^2] .$

Let $\gamma_{nm}$ denote $\gamma(\mathbf{s}_n - \mathbf{s}_m)$. Expanding the expression for the mean-squared predictive error, we get

$\sum_{n,m} \lambda_n \lambda_m \mathbb E[Z(\mathbf{s}_n)Z(\mathbf{s}_m)] + \mathbb E[Z(\mathbf{s}_0)^2] - 2\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n) Z(\mathbf{s}_0)] .$

Adding and subtracting $\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)^2]$, this expression breaks into $A+B$, where

\begin{eqnarray*}

A & = & -\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)^2] + \sum_{n,m} \lambda_n\lambda_m \mathbb E[Z(\mathbf{s}_n)Z(\mathbf{s}_m)] \\

& = & -\frac12 \sum_{n,m} \lambda_n\lambda_m \mathbb E[(Z(\mathbf{s}_n) - Z(\mathbf{s}_m))^2] \\

& = & -\sum_{n,m} \lambda_n\lambda_m \gamma_{nm} ,

\end{eqnarray*}

and

\begin{eqnarray*}

B & = & \mathbb E[Z(\mathbf{s}_0)^2] - 2\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)Z(\mathbf{s}_0)] + \sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)^2] \\

& = & \sum_n \lambda_n \mathbb E[(Z(\mathbf{s}_0) - Z(\mathbf{s}_n))^2] \\

& = & 2 \sum_n \lambda_n \gamma_{0n}.

\end{eqnarray*}

We minimize the quantity $A+B$ subject to the constraint $\sum_n \lambda_n = 1$ using Lagrange multipliers. To simplify the notation, define the matrix $\Gamma$ by $\Gamma_{nm} = \gamma_{nm}$, the vector $\boldsymbol\lambda = (\lambda_1, \ldots, \lambda_N)^{\mathsf T}$, the vector $\boldsymbol\gamma_0 = (\gamma_{01}, \ldots, \gamma_{0N})^{\mathsf T}$, and the vector $\mathbf{1} = (1, \ldots, 1)^{\mathsf T}$. Then the mean-squared predictive error is given by

$-\boldsymbol\lambda^{\mathsf T} \Gamma \boldsymbol\lambda + 2\boldsymbol\lambda^{\mathsf T} \boldsymbol\gamma_0 .$

Incorporating the Lagrange multiplier constraint, we have the quantity

$\Phi(\boldsymbol\lambda, \alpha) = -\boldsymbol\lambda^{\mathsf T} \Gamma \boldsymbol\lambda + 2\boldsymbol\lambda^{\mathsf T} \boldsymbol\gamma_0 - \alpha(\mathbf{1}^{\mathsf T}\boldsymbol\lambda - 1) .$

Differentiating $\Phi$ with respect to $\alpha$ gives back our constraint. Differentiating with respect to each $\lambda_n$ and concatenating the resulting equations into matrix form gives

\begin{eqnarray*}

& & -2\Gamma\boldsymbol\lambda + 2\boldsymbol\gamma_0 = \alpha \mathbf{1} \\

& \Leftrightarrow & \boldsymbol\lambda = \Gamma^{-1}(\boldsymbol\gamma_0 - \alpha \mathbf{1}/2) .

\end{eqnarray*}

Incorporating the constraint gives

\begin{eqnarray*}

1 & = & \mathbf{1}^{\mathsf T} \boldsymbol\lambda \\

& = & \mathbf{1}^{\mathsf T} \Gamma^{-1}(\boldsymbol\gamma_0 - \alpha \mathbf{1}/2) .

\end{eqnarray*}

Solving for $\alpha$ and plugging this back into our formula for $\boldsymbol\lambda$, we find that

$\boldsymbol\lambda = \Gamma^{-1}\left( \boldsymbol\gamma_0 - \frac{\mathbf{1}^{\mathsf T}\Gamma^{-1}\boldsymbol\gamma_0 - 1}{\mathbf{1}^{\mathsf T}\Gamma^{-1}\mathbf{1}}\mathbf{1} \right) .$

This gives us our optimal Kriging predictor.

# Fisher information

[latexpage]

I first heard about Fisher information in a statistics class, where it was given in terms of the following formulas, which I still find a bit mysterious and hard to reason about:

\begin{align*}
{\bf F}_\theta &= {\mathbb E}_x[\nabla_\theta \log p(x;\theta) (\nabla_\theta \log p(x;\theta))^T] \\
&= {\rm Cov}_x[ \nabla_\theta \log p(x;\theta) ] \\
&= {\mathbb E}_x[ -\nabla^2_\theta \log p(x; \theta) ].
\end{align*}

It was motivated in terms of computing confidence intervals for your maximum likelihood estimates. But this sounds a bit limited, especially in machine learning, where we're trying to make predictions, not present someone with a set of parameters. It doesn't really explain why Fisher information seems so ubiquitous in our field: natural gradient, Fisher kernels, Jeffreys priors, and so on.

This is how Fisher information is generally presented in machine learning textbooks. But I would choose a different starting point: Fisher information is the second derivative of KL divergence.

\begin{align*}
{\bf F}_\theta &= \nabla^2_{\theta^\prime} {\rm D}(\theta^\prime \| \theta)|_{\theta^\prime=\theta} \\
&= \nabla^2_{\theta^\prime} {\rm D}(\theta \| \theta^\prime)|_{\theta^\prime=\theta} \\
\end{align*}
(Yes, you read that correctly -- both directions of KL divergence have the same second derivative at the point where the distributions match, so locally, KL divergence is approximately symmetric.) In other words, by taking the second-order Taylor expansion, we can approximate the KL divergence between two nearby distributions with parameters $\theta$ and $\theta^\prime$ in terms of Fisher information:
${\rm D}(\theta^\prime \| \theta) \approx \frac{1}{2} (\theta^\prime - \theta)^T {\bf F}_\theta (\theta^\prime - \theta).$

Since KL divergence is roughly analogous to a distance measure between distributions, this means Fisher information serves as a local distance metric between distributions. It tells you, in effect, how much you change the distribution if you move the parameters a little bit in a given direction.

This gives us a way of visualizing Fisher information. In the following figures, each of the ovals represents the set of distributions which are distance 0.1 from the center under the Fisher metric, i.e. those distributions which have KL divergence of approximately 0.01 from the center distribution. I'll refer to these as "Fisher balls." Here is the visualization for a univariate Gaussian, parameterized in terms of mean $\mu$ and standard deviation $\sigma$:

This visualization shows that when $\sigma$ is large, changing the parameters has less effect on the distribution than when $\sigma$ is small. We can also repeat this visualization for the information form representation,

$p(x ; h, \lambda) \propto \exp \left( -\frac{\lambda}{2} x^2 + hx \right).$

Why do the ovals fan out?  Intuitively, if you rescale $h$ and $\lambda$ by the same amount, you're holding the mean fixed, so the distribution changes less than it would if you varied them independently.

Wouldn't it be great if we could find some parameterization where all the Fisher balls are unit circles?  Unfortunately, there's generally no way to enforce this globally. However, we can enforce it locally by applying an affine transformation to the parameters:

\begin{align}
\eta - \eta_0 = {\bf F}_{\theta_0}^{1/2} (\theta - \theta_0). \label{eqn:trans}
\end{align}

This stretches out the parameter space in the directions of large Fisher information and shrinks it in the directions of small Fisher information. Then KL divergence looks like squared Euclidean distance near $\eta_0$:

${\rm D}(\eta \| \eta_0) \approx {\rm D}(\eta_0 \| \eta) \approx \frac{1}{2} \|\eta - \eta_0\|^2.$

What's nice about this representation is that the local properties no longer depend on the parameterization. I.e., no matter what parameterization $\theta$ you started with, the transformed space looks roughly the same near $\eta_0$, up to a rigid transformation. This gives a way of constructing mathematical objects that are invariant to the parameterization. Roughly speaking, if an algorithm is defined in terms of local properties of the model (such as gradients), you can apply the same algorithm in the transformed space, and it won't depend on the parameterization.

When we think about Fisher information in this way, it gives some useful intuitions for why it appears in so many places:

1. As I mentioned above, Fisher information is most commonly motivated in terms of the asymptotic variance of a maximum likelihood estimator. I.e.,
$\hat{\theta} \sim {\cal N}(\theta, \frac{1}{N} {\bf F}_\theta^{-1}),$
where $N$ is the number of data points. But what this is really saying is that if you transform the space according to (\ref{eqn:trans}), the maximum likelihood estimate is distributed as a spherical Gaussian with standard deviation $1/\sqrt{N}$.
2. Natural gradient [1] is a variant of stochastic gradient descent which accounts for curvature information. It basically works by stretching the space according to (\ref{eqn:trans}), and computing the gradient in the transformed space. This is analogous to Newton's method, which essentially computes the gradient in a space that's stretched according to the Hessian. Riemannian manifold HMC [2] is an MCMC sampler based on the same idea.
3. The Jeffreys prior is an "uninformative" prior over the parameters of a probability distribution, defined as:
$p(\theta) \propto \sqrt{\det {\bf F}_\theta}.$
Since the volume of a Fisher ball is proportional to $1/\sqrt{\det {\bf F}_\theta}$, this distribution corresponds to allocating equal mass to each of the Fisher balls. The virtue is that the prior doesn't depend on how you parameterized the distribution.
4. Minimum message length [3] is a framework for model selection, based on compression. From a model you construct a two-part code for a dataset: first you encode the model parameters (using some coding scheme), and then you encode the data given the parameters. If you don't want to assume anything about the process generating the data, you might choose a coding scheme which minimizes the regret: the number of extra bits you had to spend, relative to if you were given the optimal model parameters in advance. If the parameter space is compact, an approximate regret-minimizing scheme basically involves tiling the space with K Fisher balls (for some K), using log K bits to identify one of the balls, and coding the data using the parameters at the center of that ball. In the worst case, the data distribution lies at the boundary of the ball. Interestingly, this scheme has a very similar form to the Jeffreys prior, but comes from a very different motivation.
5. Information geometry [4] is a branch of mathematics that uses differential geometry to study probabilistic models. The goal is to analyze spaces of probability distributions in terms of their intrinsic geometry, rather than by referring to some arbitrary parameterization. Defining a Riemannian manifold requires choosing a metric, and for a manifold of probability distributions, that metric is generally Fisher information.

[1] S. Amari. Natural gradient works efficiently in learning. Neural Computation, 1998.

[2] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society Series B, 2011.

[3] C. S. Wallace. Statistical and Inductive Inference by Minimum Message Length. Springer, 2005.

[4] S. Amari and H. Nagaoka. Methods of Information Geometry. American Mathematical Society, 2007.

# The Gumbel-Max Trick for Discrete Distributions

[latexpage]It often comes up in neural networks, generalized linear models, topic models and many other probabilistic models that one wishes to parameterize a discrete distribution in terms of an unconstrained vector of numbers, i.e., a vector that is not confined to the simplex, might be negative, etc. A very common way to address this is to use the "softmax" transformation:
\begin{align*}
\pi_k &= \frac{\exp\{x_k\}}{\sum_{k'=1}^K\exp\{x_{k'}\}}
\end{align*}
where the $x_k$ are unconstrained in $\mathbb{R}$, but the $\pi_k$ live on the simplex, i.e., $\pi_k \geq 0$ and $\sum_{k}\pi_k=1$. The $x_k$ parameterize a discrete distribution (not uniquely) and we can generate data by performing the softmax transformation and then doing the usual thing to draw from a discrete distribution. Interestingly, it turns out that there is an alternative way to arrive at such discrete samples, that doesn't actually require constructing the discrete distribution.

It turns out that the following trick is equivalent to the softmax-discrete procedure: add Gumbel noise to each $x_k$ and then take the argmax. That is, add independent noise to each one and then do a max. This doesn't change the asymptotic complexity of the algorithm, but opens the door to some interesting implementation possibilities. How does this work? The Gumbel distribution with unit scale and location parameter $\mu$ has the following PDF:
\begin{align*}
f(z\,;\,\mu) &= \exp\{-(z-\mu) - \exp\{-(z-\mu)\}\}.
\end{align*}
The CDF of the Gumbel is
\begin{align*}
F(z\,;\,\mu) &= \exp\{-\exp\{-(z-\mu)\}\}
\end{align*}
Now, imagine that the $k$th of our Gumbels, with location $x_k$, resulted in an outcome $z_k$. The probability that all of the other $z_{k'\neq k}$ are less than this is
\begin{align*}
\Pr(z_k \text{ is largest}\,|\, z_k, \{x_{k'}\}^K_{k'=1}) &= \prod_{k'\neq k}\exp\{-\exp\{-(z_k-x_{k'})\}\}
\end{align*}
We know the marginal distribution over $z_k$ and we need to integrate it out to find the overall probability:
\begin{align*}
\Pr(\text{$k$ is largest}\,|\,\{x_{k'}\}) =
\int \exp\{-(z_k-x_k)-\exp\{-(z_k-x_k)\}\}\\
\end{align*}
With a little bit of algebra, we get:
\begin{align*}
\Pr(\text{$k$ is largest}\,|\,\{x_{k'}\}) =&
\int \exp\{-z_k + x_k \\
\end{align*}
It turns out that this integral has closed form:
\begin{align*}
\Pr(\text{$k$ is largest}\,|\,\{x_{k'}\}) = \frac{\exp\{x_k\}}{\sum_{k'=1}^K\exp\{x_{k'}\}}
\end{align*}
We can see that this is exactly the softmax probability!

# Pseudo-marginal MCMC

This post gives a brief introduction to the pseudo-marginal approach to MCMC. A very nice explanation, with examples, is available here. Frequently, we are given a density function $\pi(\mathbf{x})$, with $\mathbf{x} \in \mathcal X$, and we use Markov chain Monte Carlo (MCMC) to generate samples from the corresponding probability distribution. For simplicity, suppose we are performing Metropolis-Hastings with a spherical proposal distribution. Then, we move from the current state $\mathbf{x}$ to a proposed state $\mathbf{x}'$ with probability $\min(1, \pi(\mathbf{x}')/\pi(\mathbf{x}))$ .

But what if we cannot evaluate $\pi(\mathbf{x})$ exactly? Such a situation might arise if we are given a joint density function $\pi(\mathbf{x}, \mathbf{z})$, with $\mathbf{z} \in \mathcal Z$, and we must marginalize out $\mathbf{z}$ in order to compute $\pi(\mathbf{x})$. In this situation, we may only be able to approximate

$$\pi(\mathbf{x}) = \int \pi(\mathbf{x},\mathbf{z}) \, \mathrm d\mathbf{z} ,$$

for instance with importance sampling. If we draw i.i.d. variables $\{\mathbf{z}_1, \ldots, \mathbf{z}_N\}$ from the distribution the density function $q(\mathbf{z})$, then our importance sampling estimate will be

$$\tilde{\pi}(\mathbf{x}) = \frac{1}{N} \sum_{n=1}^{N} \frac{\pi(\mathbf{x}, \mathbf{z}_n)}{q(\mathbf{z}_n)} .$$

What happens if we go ahead and run Metropolis-Hastings on the estimated density function $\tilde{\pi}(\mathbf{x})$? If we generate new variables $\{\mathbf{z}_1, \ldots, \mathbf{z}_N\}$ once for each proposal, then we can view this as performing a random walk through $\mathcal X \times \mathcal Z^N$. To be clear, if we are at the current state $(\mathbf{x},\mathbf{z}) = (\mathbf{x}, \mathbf{z}_1, \ldots, \mathbf{z}_N)$, we propose the state $(\mathbf{x}',\mathbf{z}') = ({\bf x'}, \mathbf{z}_1', \ldots, \mathbf{z}_N')$ by drawing ${\bf x'}$ from the original spherical proposal distribution centered on $\mathbf{x}$ and by drawing each $\mathbf{z}_n'$ from $q(\mathbf{z})$. We then accept the proposed state with probability

\begin{eqnarray*}

\frac{\sum_n \frac{\pi(\mathbf{x}', \mathbf{z}_n')}{q(\mathbf{z}_n')}}{\sum_n \frac{\pi(\mathbf{x}, \mathbf{z}_n)}{q(\mathbf{z}_n)}} & = & \frac{\left(\sum_n \frac{\pi(\mathbf{x}', \mathbf{z}_n') }{q(\mathbf{z}_n')}\right)\prod_n q(\mathbf{z}_n')}{\left(\sum_n \frac{\pi(\mathbf{x}, \mathbf{z}_n)}{q(\mathbf{z}_n)}\right)\prod_n q(\mathbf{z}_n)} \cdot \frac{\prod_n q(\mathbf{z}_n)}{\prod_n q(\mathbf{z}_n')}

\end{eqnarray*}

Since the ratio of transition probabilities is given by

$$\frac{P[(\mathbf{x}',\mathbf{z}') \to (\mathbf{x},\mathbf{z})]}{P[(\mathbf{x},\mathbf{z}) \to (\mathbf{x}',\mathbf{z}')]} = \frac{\prod_n q(\mathbf{z}_n)}{\prod_n q(\mathbf{z}_n')} ,$$

we can view this approximate Metropolis-Hastings algorithm on $\mathcal X$ as an exact Metropolis-Hastings algorithm on $\mathcal X \times \mathcal Z^N$ with stationary distribution given by

$$p(\mathbf{x}, \mathbf{z}) \propto \left( \sum_n \frac{\pi(\mathbf{x}, \mathbf{z}_n) }{q(\mathbf{z}_n)}\right) \prod_n q(\mathbf{z}_n) .$$

For each $x \in \mathcal X$, define the function $w_x \colon \mathcal Z^N \to \mathbb R$ by

$$w_{\mathbf{x}}(\mathbf{z}_1, \ldots, \mathbf{z}_N) = \frac{\frac{1}{N}\sum_n \frac{\pi(\mathbf{x}, \mathbf{z}_n)}{q(\mathbf{z}_n)}}{\pi(\mathbf{x})} .$$

This is essentially the random variable $\tilde{\pi}(\mathbf{x})/\pi(\mathbf{x})$. It follows that

$$p(\mathbf{x}, \mathbf{z}) \propto \pi(\mathbf{x}) \, w_{\mathbf{x}}(\mathbf{z}_1, \ldots, \mathbf{z}_N) \, \prod_n q(\mathbf{z}_n) .$$

But

\int p(\mathbf{x}, \mathbf{z}) \, \mathrm d\mathbf{z} \propto \pi(\mathbf{x}) \, \mathbb E[w_{\mathbf{x}}]  = \pi(\mathbf{x}) .

Indeed,

\begin{eqnarray*}

\mathbb E[w_{\mathbf{x}}] & = & \frac{1}{N} \sum_{n = 1}^N \int \frac{\pi(\mathbf{x}, \mathbf{z}_n)}{\pi(\mathbf{x}) \, q(\mathbf{z}_n)} \prod_{m=1}^N q(\mathbf{z}_m) \, \mathrm d\mathbf{z} \\

& = & \frac{1}{N} \sum_{n = 1}^N \int \pi(\mathbf{z}_n \,|\, \mathbf{x})  \, \mathrm d\mathbf{z}_n \\

& = & 1.

\end{eqnarray*}

The important part is that $\mathbb E[w_{\mathbf{x}}]$ is independent of $\mathbf{x}$. Therefore, equation (1) tells us that the stationary distribution of our Markov chain has the desired marginal distribution. In other words, we can run Metropolis-Hastings on the approximation $\tilde{\pi}(\mathbf{x})$ of the true density function and still get the correct outcome.

# Chernoff's bound

[latexpage]

If you have some randomness in your life, chances are that you want to try Chernoff's bound. The most common way to understand randomness is a 2-step combo: find the average behavior, and show that the reality is unlikely to differ too much from the expectation (via Chernoff's bound or its cousins).

My favorite form of Chernoff's bound is: for $X_1, ..., X_n$ independent binary random variables, and $X=\sum_{i=1}^n X_i$, $\mu = E[X]$ and $\delta > 2e-1$, then

$P(X > (1+\delta)\mu) < 2^{-\delta \mu}.$

Note that $X_i's$ are not necessarily identically distributed, they just have to be independent. In practice, we often care about significant deviation from the mean, so $\delta$ is typically larger than $2e-1$.

In the standard applications, the stochastic system has size $N$ and an event of interest, $X$, has expectation $\mu=O(\log N)$. The probability that any one event deviates significantly is inverse polynomial in the size of the whole system

$P(X>(1+\delta)\mu) < \frac{1}{N^{\delta}}.$

This is important since the total number of events is polynomial in $N$ for many settings. By simple Union Bound,

$P(\mbox{any } X>(1+\delta)\mu) = O(\frac{1}{N}).$

So the chance of any deviation in any event is small!