The Natural Gradient

[latexpage]

A common activity in statistics and machine learning is optimization. For instance, finding maximum likelihood and maximum a posteriori estimates require maximizing the likilihood function and posterior distribution respectively. Another example, and the motivating example for this post, is using variational inference to approximate a posterior distribution. Suppose we are interested in a posterior distribution, $p$, that we cannot compute analytically. We will approximate $p$ with the variational distribution $q(\phi)$ that is parameterized by the variational parameters $\phi$. Variational inference then proceeds to minimize the KL divergence from $q$ to $p$, $KL(q||p)$. The dominant assumption in machine learning for the form of $q$ is a product distribution, that is $q = \prod_k q_k(\phi_k)$ (where we assume there are $K$ variational parameters). It can be shown that minimizing $KL(q||p)$ is equivalent to maximizing the evidence lower bound [2], denoted $L(\phi)$.

In this post we will consider optimizing the parameters of a probability distribution and will see that using the gradient as in basic calculus can follow suboptimal directions. This can cause slow convergence or convergence to inferior local modes in non-convex problems. Along the way we will introduce some concepts from differential geometry and derive more efficient gradient directions to follow. Continue reading “The Natural Gradient”

Complexity of Inference in Bayesian Networks

[latexpage]

Developing efficient (i.e. polynomial time) algorithms with guaranteed performance is a central goal in computer science (perhaps the central goal). In machine learning, inference algorithms meeting these requirements are much rarer than we would like: often, an algorithm is either efficient but doesn’t perform optimally or vice versa. A number of results from the 1990’s demonstrate the challenges of, but also the potential for, efficient Bayesian inference. These results were carried out in the context of Bayesian networks.

Continue reading “Complexity of Inference in Bayesian Networks”

Markov chain centenary

I just attended a fun event, Celebrating 100 Years of Markov Chains, at the Institute for Applied Computational Science. There were three talks and they were taped, so hopefully you will be able to find the videos through the IACS website in the near future. Below, I will review some highlights of the first two talks by Brian Hayes and Ryan Adams; I’m skipping the last one because it was more of a review of concepts building up to and surrounding Markov chain Monte Carlo (MCMC).

The first talk was intriguingly called “First Links in the Markov Chain: Poetry and Probability” Continue reading “Markov chain centenary”

DPMs and Consistency

[latexpage]

Jeff Miller and Matthew Harrison at Brown (go Bears!) have recently explored the posterior consistency of Dirichlet process mixture (DPM) models, emphasizing one particular drawback.

For setup, say you have some observed data $x_1, …, x_n$ from a mixture of two normals, such as

$$p(x|\theta) = \frac{1}{2}\mathcal{N}(x|0,1) + \frac{1}{2}\mathcal{N}(x|6,1)$$

In this case, the number of clusters, $s$, is two, and one would imagine that as $n$ grows, the posterior distribution of $s$ would converge to 2, i.e. $p(s=2|x_{1:n}) \rightarrow 1$. However, this is not true if you model the data with a DPM (or more generally, modeling the mixing measure as a Dirichlet process, $Q \sim DP$).

Continue reading “DPMs and Consistency”

Unbiased estimators of partition functions are basically lower bounds

In machine learning, we often want to evaluate how well a model describes a dataset. In an unsupervised setting, we might use one of two criteria:

  1. marginal likelihood, or Bayes factor: the probability of the data, with all parameters and latent variables integrated out
  2. held-out likelihood, or the probability of held-out test data under the parameters learned on the training set

Both of these criteria can require computing difficult high-dimensional sums or integrals, which I’ll refer to here as the partition function. In most cases, it’s infeasible to solve these integrals exactly, so we rely on approximation techniques, such as variational inference or sampling.

Often you’ll hear claims that such-and-such an algorithm is an unbiased estimator of the partition function. What does this mean in practice, and why do we care?  Our intuitions from other sorts of unbiased estimators can be very misleading here. Continue reading “Unbiased estimators of partition functions are basically lower bounds”

Priors for Functional and Effective Connectivity

In my previous post I suggested that models of neural computation can be expressed as prior distributions over functional and effective connectivity, and with this common specification we can compare models by their posterior probability given neural recordings. I would like to explore this idea in more detail by first describing functional and effective connectivity and then considering how various models could be expressed in this framework.

Functional and effective connectivity are concepts originating in neuroimaging and spike train analysis. Functional connectivity measures the correlation between neurophysiological events (e.g. spikes on neurons or BOLD signal in fMRI voxels), whereas effective connectivity is a statement about the causal nature of a system. Effective connectivity captures the influence one neurophysiological event has upon another, either directly via a synapse, or indirectly via a polysynaptic pathway or a parallel connection. In my usage, effective connectivity may include deterministic as well as stochastic relationships. Both concepts are in contrast to structural connectivity which captures the physical synapses or fiber tracts within the brain. Of course these concepts are interrelated: functional and effective connectivity are ultimately mediated by structural connectivity, and causal effective connections imply correlational functional connections.
Continue reading “Priors for Functional and Effective Connectivity”

Machine Learning that Doesn’t Matter

At ICML last year, Kiri Wagstaff (KW) delivered a plenary talk and accompanying paper entitled “Machine Learning that Matters.” KW, a researcher at the NASA Jet Propulsion Laboratory (JPL), draws attention to a number of very serious issues in the field but draws conclusions that differ from my own.

KW criticizes existing benchmark data sets such as the UCI data sets or the MNIST handwritten digit data set for being irrelevant or obsolete. I certainly agree that being state-of-the-art on MNIST is not necessarily important (see my last post for more discussion on the need carefully craft competitions based on benchmark data sets).

Continue reading “Machine Learning that Doesn’t Matter”

A Continuous Approach to Discrete MCMC

[latexpage]Continuous problems are often simpler to solve than discrete problems. This is true in many optimization problems (for instance, linear programming versus integer linear programming). In the case of Markov chain Monte Carlo (MCMC), sampling continuous distributions has some advantages over sampling discrete distributions due to the availability of gradient information in the continuous case. The paper “Continuous Relaxations for Discrete Hamiltonian Monte Carlo” by Yichuan Zhang, Charles Sutton, Amos Storkey, and Zoubin Ghahramani explores the idea of performing inference in the discrete setting by deriving and sampling a related continuous distribution. Here I describe the approach taken in this paper. Continue reading “A Continuous Approach to Discrete MCMC”

Bayesian nonparametrics in the real world

[latexpage]

Bayesian nonparametrics allow the contruction of statistical models whose complexity is determined by the observed data. This is accomplished by specifying priors over infinite dimensional distributions. The most widely used Bayesian nonparametric priors in machine learning are the Dirichlet process, the beta process and their corresponding marginal processes the Chinese restaurant process and the Indian buffet process respectively. The Dirichlet process provides a prior for the mixing measure of an infinite mixture model and the beta process can be used as a prior for feature popularity in a latent feature model. The hierarchical Dirichlet process (HDP) also appears frequently in machine learning underlying topic models with an infinite number of topics. A main selling point of Bayesian nonparametrics has been that the complexity of the model can grow as more data is observed. While true in theory, current inference algorithms for Bayesian nonparametric models fail to scale to the massive data sets available making the model impractical. In this post I review traditional inference algorithms for Bayesian nonparametric models and discuss a new approach for handling massive data sets using stochastic variational inference. Continue reading “Bayesian nonparametrics in the real world”

Hashing, streaming and sketching

One of the questions in the air at NIPS 2012 was, how do we make machine learning algorithms scale to large datasets? There are two main approaches: (1) developing parallelizable ML algorithms and integrating them with large parallel systems and (2) developing more efficient algorithms. More often than not, the latter approach requires some sort of relaxation of an underlying task. Hashing, streaming algorithms and sketching are increasingly employed to achieve efficient approximate algorithms that arise in ML tasks. Below, I highlight a few examples, mostly from NIPS 2012, with several coming from the Big Learning workshop.

Nearest neighbor search (or similarity search) appears in many “meta” ML tasks such as information retrieval and near-duplicate detection. Many approximate approches are based on locality-sensitive hashing (LSH). The basic idea with LSH is Continue reading “Hashing, streaming and sketching”