Correlation and Mutual Information

Mutual information is a quantification of the dependency between random variables. It is sometimes contrasted with linear correlation since mutual information captures nonlinear dependence. In this short note I will discuss the relationship between these quantities in the case of a bivariate Gaussian distribution, and I will explore two implications of that relationship.

As shown below, in the case of $X$ and $Y$ having a bivariate Normal distribution, mutual information is a monotonic transformation of correlation,
$$
I(X,Y) = -\frac{1}{2} \ln
\left(1 -
Corr(X,Y)^2
\right).
$$
This relationship has a couple implications that I would like to highlight. First, it proves that lack of correlation in the bivariate Normal distribution implies independence, i.e. when the correlation between $X$ and $Y$ is zero, the mutual information will also be zero. More interestingly, if you are willing to assume that the marginal distributions of $X$ and $Y$ are Normal but you are not willing to assume joint normality, this result provides a lower bound on the mutual information between $X$ and $Y$. This lower bound follows from the maximum entropy property of the Normal distribution.

Formally, we have that the entropy of a univariate Gaussian random variable, $X$,
$$H(X) = \frac{1}{2} \ln (2\pi e \sigma_x^2).$$

For a bivariate Gaussian random variable, $(X,Y)$,
$$H(X,Y) = \frac{1}{2} \ln ((2\pi e)^2 (\sigma_x^2 \sigma_y^2 - \sigma_{xy}^2)),$$
where $\sigma_{xy}$ is the covariance of $X$ and $Y$.

Then the mutual information,
\begin{align*}
I(X,Y) &= H(X) + H(Y) - H(X,Y)
\\
&= \frac{1}{2} \ln
\left(
\frac{\sigma_x^2\sigma_y^2}{\sigma_x^2 \sigma_y^2 - \sigma_{xy}^2}
\right)
\\
&= \frac{1}{2} \ln
\left(
\frac{\sigma_y^2}{\sigma_y^2 - \sigma_{xy}^2/\sigma_x^2 }
\right)
\\
&= -\frac{1}{2} \ln
\left(
\frac{\sigma_y^2 - \sigma_{xy}^2/\sigma_x^2 }{\sigma_y^2}
\right)
\\
&= -\frac{1}{2} \ln
\left(1 -
\left(\frac{\sigma_{xy}}{\sigma_x \sigma_y}\right)^2
\right)
\\
&= -\frac{1}{2} \ln
\left(1 -
Corr(X,Y)^2
\right).
\end{align*}

The lower bound follows from the following argument. Consider two other random variables, $X'$ and $Y'$, that have the same covariance as $X$ and $Y$ but are jointly normally distributed. Note that since we have assumed that the marginal distributions of $X$ and $Y$ are Normal, we have $H(X) = H(X')$ and $H(Y) = H(Y')$, and by the maximum entropy property of the Normal distribution, we have $H(X,Y) \leq H(X',Y')$. The result is then straightforward:
\begin{align*}
I(X,Y) &= H(X) + H(Y) - H(X,Y)
\\
&\geq
H(X') + H(Y') - H(X',Y')
\\
&= -\frac{1}{2} \ln
\left(1 -
Corr(X',Y')^2
\right)
\\
&= -\frac{1}{2} \ln
\left(1 -
Corr(X,Y)^2
\right).
\end{align*}

Thanks are due to Vikash Mansinghka for suggesting the first part of this exercise.

Read More

Getting above the fray with lifted inference

Hi, I’m Jon. In my series of posts, I’ll be writing about how we can use the modern Bayesian toolkit to efficiently make decisions, solve problems, and formulate plans (the providence of AI), rather than restrict ourselves to approximating posteriors (the providence of statistics and much of machine learning).

Here’s a simple example of how AI can help out machine learning. What was the first graphical model you were exposed to? There’s a good chance it was Pearl’s famous “Sprinkler, Rain, Wet grass” graphical model[1]. To remind you, we imagine it’s either raining or not and your sprinkler is either on or not. The sprinkler usually turns off when it rains, but not always. The lawn will usually get wet if it’s either raining or the sprinkler is on, and will almost definitely get wet if both happen. There are three nodes in our graphical model and seven numbers suffice to describe all the conditional probabilities. It’s not too hard to learn those parameters from a small dataset of observations, or to perform inference on any subset of the variables, given the others.

 

Read More

Variograms, Covariance functions and Stationarity

I just started a course on spatial statistics, so I've got covariance functions and variograms on the mind. This post is mostly for me to work through their intuition and relationship.

Say you have some spatio-temporal process, with specific locations denoted $s_1, s_2, \dots$, with the value of the process those points are $z(s_1), z(s_2), \dots$. For concreteness, these locations could be latitude and longitude and the field could be the outdoor temperature. Or maybe the locations are the the space-time $(x,y,t)$ of a player on a basketball court and the field is her shot percentage or scoring efficiency from that point.

Either way, one basic intuition is that values from two close points in a field will be more 'similar' than two distant points. This is a loaded assumption. Even temperature values are influenced by weather patterns and other exogenous variables can create unexplained variance. But we'll go with it for now (and there are ways of accounting for these latent influences).

One way to encode this assumption is a covariance function, which simply takes in two points $s_1, s_2$ and spits out a covariance value. Given a covariance function $K(s, s')$, the random variables $Z(s_1), Z(s_2), \dots, Z(s_n)$ will have the covariance structure given by
\begin{align*}
C &=
\begin{pmatrix}
K(s_1, s_1) & \dots && K(s_1, s_n) \\
\vdots & \dots && \vdots \\
K(s_n, s_1) & \dots && K(s_n, s_n)
\end{pmatrix}
\end{align*}
Note that this function is restricted to only create positive semidefinite matrices. One commonly used covariance function is
\begin{align*}
K(s_1, s_2) = \sigma^2 \exp(-|s_1 - s_2|)
\end{align*}
and another may square the difference.

One thing to notice is that this particular covariance function relies only on the distance between $s_1$ and $s_2$, (denoted by $h$), meaning it takes the form
\begin{align*}
K(h) = K(s+h, s)
\end{align*}
When the covariance structure of a field only relies on this distance between points, and we assume that there is some mean value $E(Z(s)) = \mu$, then the process is second order (or weakly) stationary (and this particular case the covariance only relies on the magnitude of the distance, not the direction, making it isotropic).

Intuitively, this covariance matrix encodes that we expect the covariance in temperature between Boston and New York to be much higher than the covariance between Seattle and Miami.

But you can also consider the variation of the difference in temperature between a pair of cities. This is the intuition behind a variogram. The variogram function describes how we expect the value of the field to vary given the two positions. It is formally defined
\begin{align*}
2 \gamma(s_1, s_2) = var( Z(s_1) - Z(s_2) )
\end{align*}
Also note that this function needs to spit out positive values for all locations $s_i, s_j$.

When the variance between two locations in the process relies only on the distance (and again, we have some mean value) then the process is said to be intrinsically stationary.
\begin{align*}
2 \gamma(h) = 2 \gamma(Z(s+h) - Z(s))
\end{align*}

And it turns out that the class of second order stationary processes is a subclass of the broader class of intrinsically stationary processes.

We can also relate the concepts of second order stationery covariance functions to intrinsically stationary variograms.
\begin{align*}
2 \gamma(h) &= var( Z(s+h) - Z(s) ) \\
&= E[ Z(s+h)^2 ] + E[ Z(s)^2 ] - 2 E[ Z(s+h) Z(s) ] \\
&= var(Z(s+h)) + \mu^2 + var(Z(s)) + \mu^2 - 2 (cov(Z(s+h), Z(s)) + 2\mu^2) \\
&= 2K(0) - 2K(h) \\
\implies \gamma(h) &= K(0) - K(h)
\end{align*}

So there you have it - two ways of modeling spatial statistical dependency and how they relate.

[1] Noel Cressie and Christopher K. Wikle.  Statistics for Spatio-Temporal Data (Chapter 4). Wiley , 2011.
[1] Luke Bornn, Gavin Shaddick, and James V. Zidek. Modeling Nonstationary Processes Through Dimension Expansion. Journal of the American Statistical Association, March 2012, Vol. 107, No. 497, Theory and Methods.

Read More

What the hell is representation? *

Roger Grosse’s post on the need for a “solid theoretical framework” for “representation learning” is very intriguing. The term representation is ubiquitous in machine learning (for instance, it showed up in at least eight previous posts in this blog) and computational neuroscience (how are objects and concepts represented within the brain).

My personal fascination with the topic started after watching David Krakauer’s talk on evolution of intelligence on earth, where he listed representation- in additions to inference, strategy, and Competition- as one of the tenets of intelligence; suggesting that our representations are tightly connected to the goals we aim to accomplish, how we infer hidden causes, what strategy we take on, and what competitive forces we have to deal with.

Professor Krakauer goes on to reason that what enabled invention of Algebra was “efficient” representation of numbers via the Arabic numeral system (think about 3998 in Arabic numerals versus Roman numerals: MMMCMXCVIII), which allowed for easier manipulation of numbers (for exmaple, using Arabic numerals: 42 x 133 = 5586, versus using Roman numerals: XLII x CXXXIII = MMMMMDLXXXVI). The Arabic numeral system is not only more compressed, but also yields itself to easier compositional rules (not clear how the second multiplication works!).

So, why representation is important? The short answer is that “good” representation makes life so much easier, presumably, by reducing the computational burden of doing inference/classification/prediction. How can we systematically arrive at a good representation? In general, good representations seem to keep only those features of the data that co-vary the most with respect to outcomes of interest. If so, then there are no “good” representations, only good representations with respect to a set of objectives, given all the computational and resource constraints.

In conclusion, I suspect combining the unsupervised-learning and supervised-learning into one coherent learning framework can be a good starting point (i.e., extracting latent features from the data that are maximally predictive of outcomes of interest; see Outcome Discriminative Learning and deep learning).

 

There are these two young fish swimming along, and they happen to meet an older fish swimming the other way, who nods at them and says, "Morning, boys, how's the water?" And the two young fish swim on for a bit, and then eventually one of them looks over at the other and goes, "What the hell is water?" -- David Foster Wallace

Read More

Predictive learning vs. representation learning

When you take a machine learning class, there's a good chance it's divided into a unit on supervised learning and a unit on unsupervised learning. We certainly care about this distinction for a practical reason: often there's orders of magnitude more data available if we don't need to collect ground-truth labels. But we also tend to think it matters for more fundamental reasons. In particular, the following are some common intuitions:

  • In supervised learning, the particular algorithm is usually less important than engineering and tuning it really well. In unsupervised learning, we'd think carefully about the structure of the data and build a model which reflects that structure.
  • In supervised learning, except in small-data settings, we throw whatever features we can think of at the problem. In unsupervised learning, we carefully pick the features we think best represent the aspects of the data we care about.
  • Supervised learning seems to have many algorithms with strong theoretical guarantees, and unsupervised learning very few.
  • Off-the-shelf algorithms perform very well on a wide variety of supervised tasks, but unsupervised learning requires more care and expertise to come up with an appropriate model.

I'd argue that this is deceptive. I think real division in machine learning isn't between supervised and unsupervised, but what I'll term predictive learning and representation learning. I haven't heard it described in precisely this way before, but I think this distinction reflects a lot of our intuitions about how to approach a given machine learning problem.

In predictive learning, we observe data drawn from some distribution, and we are interested in predicting some aspect of this distribution. In textbook supervised learning, for instance, we observe a bunch of pairs $(x_1, y_1), \ldots, (x_N, y_N)$, and given some new example $x$, we're interested in predicting something about the corresponding $y$. In density modeling (a form of unsupervised learning), we observe unlabeled data $x_1, \ldots, x_N$, and we are interested in modeling the distribution the data comes from, perhaps so we can perform inference in that distribution. In each of these cases, there is a well-defined predictive task where we try to predict some aspect of the observable values possibly given some other aspect.

In representation learning, our goal isn't to predict observables, but to learn something about the underlying structure. In cognitive science and AI, a representation is a formal system which maps to some domain of interest in systematic ways. A good representation allows us to answer queries about the domain by manipulating that system. In machine learning, representations often take the form of vectors, either real- or binary-valued, and we can manipulate these representations with operations like Euclidean distance and matrix multiplication. For instance, PCA learns representations of data points as vectors. We can ask how similar two data points are by checking the Euclidean distance between them.

In representation learning, the goal isn't to make predictions about observables, but to learn a representation which would later help us to answer various queries. Sometimes the representations are meant for people, such as when we visualize data as a two-dimensional embedding. Sometimes they're meant for machines, such as when the binary vector representations learned by deep Boltzmann machines are fed into a supervised classifier. In either case, what's important is that mathematical operations map to the underlying relationships in the data in systematic ways.

Whether your goal is prediction or representation learning influences the sorts of techniques you'll use to solve the problem. If you're doing predictive learning, you'll probably try to engineer a system which exploits as much information as possible about the data, carefully using a validation set to tune parameters and monitor overfitting. If you're doing representation learning, there's no good quantitative criterion, so you'll more likely build a model based on your intuitions about the domain, and then keep staring at the learned representations to see if they make intuitive sense.

In other words, it parallels the differences I listed above between supervised and unsupervised learning. This shouldn't be surprising, because the two dimensions are strongly correlated: most supervised learning is predictive learning, and most unsupervised learning is representation learning. So to see which of these dimensions is really the crux of the issue, let's look at cases where the two differ.

Language modeling is a perfect example of an application which is unsupervised but predictive. The goal is to take a large corpus of unlabeled text (such as Wikipedia) and learn a distribution over English sentences. The problem is motivated by Bayesian models for speech recognition: a distribution over sentences can be used as a prior for what a person is likely to say. The goal, then, is to model the distribution, and any additional structure is unnecessary. Log-linear models, such as that of Mnih et al. [1], are very good at this, and recurrent neural nets [2] are even better. These are the sorts of approaches we'd normally apply in a supervised setting: very good at making predictions, but often hard to interpret. One state-of-the-art algorithm for density modeling of text is PAQ [3], which is a heavily engineered ensemble of sequential predictors, somewhat reminiscent of the winning entries of the Netflix competition.

On the flip side, supervised neural nets are often used to learn representations. One example is Collobert-Weston networks [4], which attempt to solve a number of supervised NLP tasks by learning representations which are shared between them. Some of the tasks are fairly simple and have a large amount of labeled data, such as predicting which of two words should be used to fill in the blank. Others are harder and have less data available, such as semantic role labeling. The simpler tasks are artificial, and they are there to help learn a representation of words and phrases as vectors, where similar words and phrases map to nearby vectors; this representation should then help performance on the harder tasks. We don't care about the performance on those tasks per se; we care whether the learned embeddings reflect the underlying structure. To debug and tune the algorithm, we'd focus on whether the representations make intuitive sense, rather than on the quantitative performance. There are no theoretical guarantees that such an approach would work -- it all depends on our intuitions of how the different tasks are related.

Based on these two examples, it seems like it's the predictive/representation dimension which determines how we should approach the problem, rather than supervised/unsupervised.

In machine learning, we tend to think there's no solid theoretical framework for unsupervised learning. But really, the problem is that we haven't begun to formally characterize the problem of representation learning. If you just want to build a density modeler, that's about as well understood as the supervised case. But if the goal is to learn representations which capture the underlying structure, that's much harder to formalize. In my next post, I'll try to take a stab at characterizing what representation learning is actually about.

[1] Mnih, A., and Hinton, G. E. Three new graphical models for statistical language modeling. NIPS 2009

[2] Sutskever, I., Martens, J., and Hinton, G. E. Generating text with recurrent neural networks. ICML 2011

[3] Mahoney, M. Adaptive weighting of context models for lossless data compression. Florida Institute of Technology Tech report, 2005

[4] Collobert, R., and Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. ICML 2008

Read More

What is the Computational Capacity of the Brain?

One big recent piece of news from across the Atlantic is that the European Commission is funding a brain simulation project to the tune of a billion Euros. Roughly, the objective is to simulate the entire human brain using a supercomputer. Needless to say, many people are skeptical and there are lots of reasons that one might think this project is unlikely to yield useful results. One criticism centers on whether even a supercomputer can simulate the complexity of the brain. A first step towards simulating a brain is thinking about how many FLOP/s (floating point operations per second) would be necessary to implement similar functionality in conventional computer hardware. Here I will discuss two back-of-the-envelope estimates of this computational capacity. (Don't take these too seriously, they're a little crazy and just shooting for orders of magnitude.)

Take One: Count the Integrating Neurons
People who know about such things seem to think that the human brain has about 1e9 neurons. On average, let's assume that a neuron has incoming connections from 1e3 other neurons. Furthermore, let's assume that these neurons are firing at about 100Hz and that the post-synaptic neuron does the equivalent of 10 floating point operations each time it receives a spike. I base this sketchy guess on what would be necessary to perform a cumulative sum for an integrate-and-fire model. Put these numbers together and you need about 1e13 floating point operations one hundred times per second, or one petaFLOP/s -- 1e15 FLOP/s.

Take Two: Multiply Up the Retina
This computation is due to Hans Moravec, discussed in this article. The basic idea is to take the retina, which is a piece of the brain that we understand pretty well, and assume that the rest of the brain is similar. The starting point is to imagine that the retina is processing a 1e6 "pixel" image about ten times a second. If you figure that each of these images takes about 1e8 floating point operations to process, then you're looking at a gigaFLOP/s to match the processing power of the retina. The brain is about 1e5 times larger than the retina, which gets you to 1e14 FLOP/s. I think this is is kind of fun, because it's a different way to think about the problem, but isn't impossibly far away from the previous estimate.

It's also interesting, because a petaFLOP/s is well within our current computational capabilities. An off-the-shelf gaming GPU such as the NVIDIA GeForce GTX 680 can do 3 teraFLOP/s. A warehouse full of these is a lot, but not impossibly many. Indeed Oak Ridge National Laboratory has a machine that has hit a benchmark 18 petaFLOP/s.

(Edit: This post originally incorrectly said that the GTX 680 had 3 gigaFLOP/s.)

Read More

Is AI scary?

In today's New York Times, Huw Price, professor of philosophy at Cambridge, writes about the need for considering the potential dangers associated with a possible "singularity." The singularity is the idea, I guess, that if people create machines that are smarter than people then those machines would be smart enough to create machines smarter than themselves, etc., and that there would be an exponential explosion in artificial intelligence. Price suggests that whether or not the singularity is likely enough to warrant study in its own right, it is the possible danger associated with it that makes it important.

I'm not remotely worried about this. As someone who has been toiling away for many months at creating an artificial intelligence algorithm that has something evolutionary about it, I feel that my pessimism (or, optimism, as Price calls it) is informed. But rather than try to explain why I'm pessimistic, I thought I would present and react to only one point that Price makes. He writes:

biology got us onto this exalted peak in the landscape, the tricks are all there for our inspection: most of it is done with the glop inside our skulls. Understand that, and you understand how to do it artificially, at least in principle. Sure, it could turn out that there’s then no way to improve things – that biology, despite all the constraints, really has hit some sort of fundamental maximum. Or it could turn out that the task of figuring out how biology did it is just beyond us, at least for the foreseeable future (even the remotely foreseeable future). But again, are you going to bet your grandchildren on that possibility?

Read More

Dealing with Reliability when Crowdsourcing

I recently read the paper "Variational Inference for Crowdsourcing," by Qiang Liu, Jian Peng, and Alexander Ihler. They present an approach using belief propagation to deal with reliability when using crowdsourcing to collect labeled data. This post is based on their exposition. Crowdsourcing (via services such as Amazon Mechanical Turk) has been used as a cheap way to amass large quantities of labeled data. However, the labels are likely to be noisy.

To deal with this, a common strategy is to employ redundancy: each task is labeled by multiple workers. For simplicity, suppose there are $N$ tasks and $M$ workers, and assume that the possible labels are $\{\pm 1\}$. Define the $N \times M$ matrix $L$ so that $L_{ij}$ is the label given to task $i$ by worker $j$ (or $0$ if no label was provided). Let $z = (z_1, \ldots, z_N)$ be the true labels of the tasks. Given $L$, we wish to come up with an estimator $\hat{z}$ of $z$ so as to minimize the average error

Read More

The Natural Gradient

A common activity in statistics and machine learning is optimization. For instance, finding maximum likelihood and maximum a posteriori estimates require maximizing the likilihood function and posterior distribution respectively. Another example, and the motivating example for this post, is using variational inference to approximate a posterior distribution. Suppose we are interested in a posterior distribution, $p$, that we cannot compute analytically. We will approximate $p$ with the variational distribution $q(\phi)$ that is parameterized by the variational parameters $\phi$. Variational inference then proceeds to minimize the KL divergence from $q$ to $p$, $KL(q||p)$. The dominant assumption in machine learning for the form of $q$ is a product distribution, that is $q = \prod_k q_k(\phi_k)$ (where we assume there are $K$ variational parameters). It can be shown that minimizing $KL(q||p)$ is equivalent to maximizing the evidence lower bound [2], denoted $L(\phi)$.

In this post we will consider optimizing the parameters of a probability distribution and will see that using the gradient as in basic calculus can follow suboptimal directions. This can cause slow convergence or convergence to inferior local modes in non-convex problems. Along the way we will introduce some concepts from differential geometry and derive more efficient gradient directions to follow.

For the rest of this post we will assume the setting of variational inference so that the function to optimize is $L(\phi)$ and the components of $\phi$ are the variational parameters.

Basic calculus tells us that to optimize a continuous function (the objective function) we set the derivative of the function to zero and solve. In all but the simplest cases an analytic solution will not exist and we will need to resort to numerical methods. The simplest numerical optimization algorithm to maximize a function is gradient ascent (descent if one is minimizing a function) which updates the current value of the parameters, $\phi^{(j)}$, to a new value, $\phi^{(j+1)}$, such that $L(\phi^{(j)}) \leq L(\phi^{(j+1)})$ by following the gradient. The update rule is of the form
\begin{align*}
\phi^{(j+1)} = \phi^{(j)} + \gamma \nabla L(\phi)
\end{align*}
where $\nabla L(\phi) = (\frac{\partial L}{\partial \phi_1},\ldots,\frac{\partial L}{\partial \phi_K})$ is the gradient vector of $L(\phi)$ and $\gamma$ determines how for to move along the gradient. The gradient vector points in the direction that the function increases most quickly, where changes in the function are measured with respect to Euclidean distance. In other words, the gradient indicates the direction such that the smallest change in parameter values results in the largest change in function value, where changes in the parameters and function values are measured by the Euclidean distance. If the Euclidean distance between the variables being optimized (the variational parameters in our running example) is not a good measure of variation in the objective function then gradient ascent will move suboptimaly through the parameter values.

The Euclidean distance between parameter values of distributions turns out to not be a good measure of the variation between the two distributions. As a simple example [2] consider the two distributions $N(0,1000)$ and $N(10,1000)$ as well as the two distributions $N(0,0.1)$ and $N(0.1,0.1)$. The first pair of distributions are essentially the same whereas the second two do not share any significant support. Yet, the Euclidean distance between the parameters in the first pair is $10$, while in the second the distance is only $0.1$. The reason for this is that probability distributions do not naturally live in Euclidean space but rather on a manifold (a statistical manifold to be precise). There are better ways of defining the distance between distributions, one of the simplest being the symmetrized Kullback-Leibler divergence
\begin{align*}
KL_{sym}(p_1,p_2) = \frac{1}{2}(KL(p_1||p_2) + KL(p_2||p_1))
\end{align*}
which will arise later.

To understand how to improve the gradient ascent algorithm when optimizing the parameters of probability distributions we must understand a bit of differential geometry. Assume that $\phi \in \Phi \subset R^n$ where $\Phi$ is a parameter space of interest. When $\Phi$ is Euclidean space with an orthonormal basis the squared distance between two points $\phi$ and $\phi + d\phi$ is given by the usual Euclidean inner product
\begin{align*}
||d\phi||^2 = \left < d\phi, d\phi \right >
\end{align*}
When $\Phi$ is a manifold the previous distance is given by the bilinear form
\begin{align*}
||d\phi||^2 = \left < d\phi, G(\phi)d\phi \right > = \sum_{ij} g_{ij}(\phi)d\phi_i d\phi_j
\end{align*}
The matrix $G(\phi) = [g_{ij} (\phi)]$ is called the Riemannian metric tensor and is an inner product on the tangent space of the manifold $\Phi$ at the point $\phi$. This means that $G$ allows us to define a norm, and thus distances on the manifold (the tangent space at a point can be thought of as the space of directional derivatives at the point). Note that $G$ depends on the location $\phi$, a consequence of the fact that $G$ is defined with respect to the tangent space at $\phi$ and so is location dependent.

As a concrete example, imagine two people standing on two different mountain tops. If one of the people is Superman (or anybody else who can fly) then the distance they would fly directly to the other person is the Euclidean distance (in $R^3$). If both people were normal and needed to walk on the surface of the Earth the Riemannian metric tensor tells us what this distance is from the Euclidean distance. Don't take this illustration too seriously since technically all of this needs to take place in a differential patch (a small rectangle whose side lengths go to $0$). Intuitively, the Riemannian metric tensor describes how the geometry of a manifold affects a differential patch, $d\phi$, at the point $\phi$. The length of a line between two points on $d\phi$ is the distance between them. The Riemannian metric tensor either stretches or shrinks that line and the resulting length is the distance between the two points on the manifold.

In Euclidean space with an orthonormal basis $G(\phi)$ is simply the identity matrix. When $\Phi$ is a space of parameters of probability distributions and the symmetrized KL divergence is used to measure the distance between distributions then $G(\phi)$ turns out to be the Fisher information matrix. This arises from placing an inner product on the statistical manifold of log-probability densities [3].

As stated above, the gradient vector of a function $L(\phi)$ points in the direction of steepest ascent. We can formally write this as maximize $L(\phi + d\phi)$ such that $||d\phi||^2 < \epsilon^2$ for $\epsilon$ small. To solve this optimization problem we set $d\phi = \epsilon v$ for some vector $v$. We can then rewrite the optimization problem as maximizing
\begin{align*}
L(\phi + d\phi) = L(\phi) + \epsilon \nabla L(\phi)^T v
\end{align*}
under the constraint that $||v||^2 = <v, G(\phi)v> = 1$, that is $v$ must be unit length under the inner product defined by $G(\phi)$ and so we are searching for the direction that causes the function to increase the most where distances arise from the inner product defined by $G$. We can rewrite this constrained optimization problem as an unconstrained problem using Lagrange mulitpliers
\begin{align*}
L(\phi) + \nabla L(\phi)^T v - \lambda v^T G(\phi)v
\end{align*}
and after taking the derivative with respect to $v$ and setting it to zero we obtain $\nabla L(\phi) = 2 \lambda G(\phi) v$ and so $v = \frac{1}{2\lambda}G^{-1}\nabla L(\phi)$. We can then solve for $\lambda$ using the unit-vector constraint on $v$. We see that when $\Phi$ is Euclidean space and $G$ is the identity matrix then the usual gradient vector of $L$ is the solution. When $G$ is non-trivial the vector $G(\phi)^{-1}\nabla L(\phi)$ is called the natural gradient and as we have just seen is the direction that $L$ increases most quickly when the parameter space is a manifold with Riemannian metric $G$. This tells us that a more efficient variational inference algorithm is to follow the natural gradient of the variational parameters, where the Riemannian metric tensor is just the Fisher information matrix of the variational distribution.

The derivations in this paper followed that in [1] which goes on to show what the natural gradient looks like for neural networks and other models and proves that gradient descent with the natural gradient is efficient in a certain sense. The application of differential geometry to statistics is a deep subject. See the book "Differential Geometry and Statistics" [3] for a nice introduction. Perhaps in a future post we can delve into why the Fisher information matrix appears as the Riemannian metric tensor for statistical manifolds.

References:
[1] Amari, S., Natural Gradient Works Efficiently in Learning. In Neural Computation, Vol. 10, No. 2, 1998.
[2] Hoffman, M., Blei, D. M., Wang, C., Paisley, J., Stochastic Variational Inference.  arXiv: 1206.7051.
[3] Murray, M.K., Rice, J.W., Differential Geometry and Statistics. Monographs on Statistics and Applied Probability, No. 48, 1993.

Read More

Complexity of Inference in Bayesian Networks

Developing efficient (i.e. polynomial time) algorithms with guaranteed performance is a central goal in computer science (perhaps the central goal). In machine learning, inference algorithms meeting these requirements are much rarer than we would like: often, an algorithm is either efficient but doesn't perform optimally or vice versa. A number of results from the 1990's demonstrate the challenges of, but also the potential for, efficient Bayesian inference. These results were carried out in the context of Bayesian networks.

Briefly, recall that a Bayesian network consists of a directed acyclic graph with a random variable $X_i$ at each vertex. Let $\pi_i$ be the parents of $X_i$. Then the Bayes net defines a distribution over $X = (X_1,\dots,X_n)$ of the form

\begin{equation} \Pr[X] = \prod_{i=1}^n \Pr[X_i | \pi_i]. \end{equation}

Inference in a Bayes net corresponds to calculating the conditional probability $\Pr[Y | Z = z]$, where $Y,Z \subset \{ X_1,\dots,X_n \}$ are sets of latent and observed variables, respectively. Cooper [1] showed that exact inference in Bayes nets is NP-hard. (Here and in other results mentioned, the size of the problem is given by the total size of the probability tables needed to represent the Bayes net.) The reason the Cooper result holds is essentially that Bayes nets can be used to encode boolean satisfiability (SAT) problems, so solving the generic Bayes net inference problem lets you solve any SAT problem. But SAT is NP-complete, so Bayes net inference must be NP-hard.

A number of stronger and even more interesting results followed. First, Dagum and Luby [2] showed that even approximate inference is NP-hard. Specifically, let $\epsilon \in [0,1]$, define $Y$ and $Z$ as before, and define $p := \Pr[Y | Z = z]$. Then an absolute approximation $0 \le V \le 1$ satisfies

\begin{equation} p - \epsilon \le V \le p + \epsilon \end{equation}

while  $0 \le V \le 1$ is a relative approximation if

\begin{equation} p/(1+\epsilon) \le V \le p(1 + \epsilon). \end{equation}

Dagum and Luby showed that finding either an absolute or relative approximation for $p$ deterministically is NP-hard. Furthermore, finding either type of approximation with high probability using a randomized, polynomial time algorithm is impossible unless NP $\subseteq$ RP (RP is a randomized version of P, where on negative inputs the algorithm must always be correct but can be correct on positive inputs with only probability $\ge 1/2$). There is good reason to believe this is not true because complexity theorists believe P = RP and P $\ne$ NP. Hence, we should not expect to be able to approximate conditional probabilities in arbitrary Bayes nets.

Roth [4] strengthened portions Dagum and Luby's result by showing that relative approximation of Bayes nets is #P-complete. Recall that #P is the counting version of NP (instead of, "Is there a solution?", the question is, "How many solutions are there?"). #P is believed to be a much harder class than NP. Indeed, a polynomial time algorithm with access to an oracle for a #P-complete problem can efficiently solve any problem in the polynomial hierarchy, which include NP, co-NP, and much more.

While these hardness of approximation results may appear disheartening, Dagum and Luby [3] were able to give a randomized, polynomial time algorithm which gives a relative approximation for $\Pr[Y | Z = z]$ under one important, but very natural assumption. The proviso is that the conditional probabilities in the Bayes net are not extreme. Specifically, let

\begin{equation} l(X_i = x_i) = \min_y \Pr[X_i = x_i | \pi_i = y] \end{equation}

and

\begin{equation} u(X_i = x_i) = \max_y \Pr[X_i = x_i | \pi_i = y]. \end{equation}

Define the local variance bound (LVB) to be the maximum value of $u(X_i = x_i)/l(X_i = x_i)$ over all possible $X_i, x_i$. The algorithm requires that the LVB be bounded by a polynomial of the size of the network. A nice property of the LVB requirement is that if each $[l(X_i = x_i), u(X_i = x_i)] \subseteq [p_i, cp_i]$ for some constant $c$ and any set of $p_i$'s, the LVB is still a constant, even if some of the $p_i$ are very small! The intuition for the LVB condition is that if you do have extreme probabilities, then you can have values "close" to 0 and 1, which would allow you to embed a SAT problem in the Bayes net. But if the ratio between the minimum and maximum conditional probabilities isn't too large, then this won't be possible.

There are a number of takeaways and issues these results raise, particularly if we think of Bayes nets as a prototype for Bayesian generative models in general. The positive results by Dagum & Luby based on excluding extreme probabilities suggests that one way to ensure tractability is to place restrictions on the distributions one considers instead of trying to work at the level structural constraints (though these may be necessary as well).  Furthermore, their results point to determinism being the enemy! As long as the $X_i$ are really random, inference in tractable. It's only when the $X_i$ are close to being deterministic the problems arise. Although I didn't mention it above,  both D&L and Roth's results point toward the fact that it matters what questions you ask (i.e. which conditional probabilities you are interested in) –– some may be tractable to calculate while others may not be. And finally, perhaps we are limiting ourselves by asking for probabilities! Instead of calculating conditional probabilities, we could sample from the posterior (via, say, MCMC). Indeed, this is what D&L's algorithm does, suggesting that sampling and calculating probabilities may be roughly equivalent in Bayes net world. However, this is not true in general. Indeed, under reasonable complexity assumptions, there are distributions that can be sampled in polynomial time but whose distribution cannot be computed efficiently [5]. Thus, there are potentially interesting models with posteriors which can be sampled efficiently but not calculated efficiently.

References:

[1] Cooper, G. F., 1990. The computational complexity of probabilistic inference using Bayesian belief networks. Artif. Intell. 42, 393-405.
[2] Dagum, P. & Luby, M., 1993. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artif. Intell. 60, 141–153.
[3] Dagum, P. & Luby, M., 1997. An optimal approximation algorithm for Bayesian inference. Artif. Intell. 93, 1-27.
[4] Roth, D., 1996. On the Hardness of Approximate Reasoning. Artif. Intell. 82, 273-302.
[5] Yamakami, T., 1999. Polynomial time samplable distributions. J. of Complexity 15(4), 557-574.

Read More