## What is the Computational Capacity of the Brain?

One big recent piece of news from across the Atlantic is that the European Commission is funding a brain simulation project to the tune of a billion Euros. Roughly, the objective is to simulate the entire human brain using a supercomputer. Needless to say, many people are skeptical and there are lots of reasons that one might think this project is unlikely to yield useful results. One criticism centers on whether even a supercomputer can simulate the complexity of the brain. A first step towards simulating a brain is thinking about how many FLOP/s (floating point operations per second) would be necessary to implement similar functionality in conventional computer hardware. Here I will discuss two back-of-the-envelope estimates of this computational capacity. (Don’t take these too seriously, they’re a little crazy and just shooting for orders of magnitude.)
Continue reading “What is the Computational Capacity of the Brain?”

## Is AI scary?

In today’s New York Times, Huw Price, professor of philosophy at Cambridge, writes about the need for considering the potential dangers associated with a possible “singularity.” The singularity is the idea, I guess, that if people create machines that are smarter than people then those machines would be smart enough to create machines smarter than themselves, etc., and that there would be an exponential explosion in artificial intelligence. Price suggests that whether or not the singularity is likely enough to warrant study in its own right, it is the possible danger associated with it that makes it important.

I’m not remotely worried about this. As someone who has been toiling away for many months at creating an artificial intelligence algorithm that has something evolutionary about it, I feel that my pessimism (or, optimism, as Price calls it) is informed. But rather than try to explain why I’m pessimistic, I thought I would present and react to only one point that Price makes. He writes:

biology got us onto this exalted peak in the landscape, the tricks are all there for our inspection: most of it is done with the glop inside our skulls. Understand that, and you understand how to do it artificially, at least in principle. Sure, it could turn out that there’s then no way to improve things – that biology, despite all the constraints, really has hit some sort of fundamental maximum. Or it could turn out that the task of figuring out how biology did it is just beyond us, at least for the foreseeable future (even the remotely foreseeable future). But again, are you going to bet your grandchildren on that possibility?

## Dealing with Reliability when Crowdsourcing

[latexpage]I recently read the paper “Variational Inference for Crowdsourcing,” by Qiang Liu, Jian Peng, and Alexander Ihler. They present an approach using belief propagation to deal with reliability when using crowdsourcing to collect labeled data. This post is based on their exposition. Crowdsourcing (via services such as Amazon Mechanical Turk) has been used as a cheap way to amass large quantities of labeled data. However, the labels are likely to be noisy.

To deal with this, a common strategy is to employ redundancy: each task is labeled by multiple workers. For simplicity, suppose there are $N$ tasks and $M$ workers, and assume that the possible labels are $\{\pm 1\}$. Define the $N \times M$ matrix $L$ so that $L_{ij}$ is the label given to task $i$ by worker $j$ (or $0$ if no label was provided). Let $z = (z_1, \ldots, z_N)$ be the true labels of the tasks. Given $L$, we wish to come up with an estimator $\hat{z}$ of $z$ so as to minimize the average error

\begin{align}

\frac{1}{N} \sum_{i=1}^N \text{prob}[\hat{z}_i \ne z_i] .

[latexpage]

A common activity in statistics and machine learning is optimization. For instance, finding maximum likelihood and maximum a posteriori estimates require maximizing the likilihood function and posterior distribution respectively. Another example, and the motivating example for this post, is using variational inference to approximate a posterior distribution. Suppose we are interested in a posterior distribution, $p$, that we cannot compute analytically. We will approximate $p$ with the variational distribution $q(\phi)$ that is parameterized by the variational parameters $\phi$. Variational inference then proceeds to minimize the KL divergence from $q$ to $p$, $KL(q||p)$. The dominant assumption in machine learning for the form of $q$ is a product distribution, that is $q = \prod_k q_k(\phi_k)$ (where we assume there are $K$ variational parameters). It can be shown that minimizing $KL(q||p)$ is equivalent to maximizing the evidence lower bound [2], denoted $L(\phi)$.

In this post we will consider optimizing the parameters of a probability distribution and will see that using the gradient as in basic calculus can follow suboptimal directions. This can cause slow convergence or convergence to inferior local modes in non-convex problems. Along the way we will introduce some concepts from differential geometry and derive more efficient gradient directions to follow. Continue reading “The Natural Gradient”

## Complexity of Inference in Bayesian Networks

[latexpage]

Developing efficient (i.e. polynomial time) algorithms with guaranteed performance is a central goal in computer science (perhaps the central goal). In machine learning, inference algorithms meeting these requirements are much rarer than we would like: often, an algorithm is either efficient but doesn’t perform optimally or vice versa. A number of results from the 1990’s demonstrate the challenges of, but also the potential for, efficient Bayesian inference. These results were carried out in the context of Bayesian networks.

## Markov chain centenary

I just attended a fun event, Celebrating 100 Years of Markov Chains, at the Institute for Applied Computational Science. There were three talks and they were taped, so hopefully you will be able to find the videos through the IACS website in the near future. Below, I will review some highlights of the first two talks by Brian Hayes and Ryan Adams; I’m skipping the last one because it was more of a review of concepts building up to and surrounding Markov chain Monte Carlo (MCMC).

The first talk was intriguingly called “First Links in the Markov Chain: Poetry and Probability” Continue reading “Markov chain centenary”

## Aversion of Inversion

[latexpage]

In the spirit of Ryan’s most recent post, I will discuss a fundamental snippet from numerical linear algebra that facilitates computation for the same price of not facilitating it. In our everyday lives, we often come across theoretical expressions that involve matrix inverses stapled to vectors, such as $\Omega^{-1}\mathbf{x}$ with $\Omega\in\mathbb{R}^{n\times n}, \mathbf{x}\in\mathbb{R}^n$. When we proceed to code this up, it is very tempting to first compute $\Omega^{-1}$. Resist doing this! There are several points for why there is no point to actually find an explicit, tangible inverse. Continue reading “Aversion of Inversion”

## Introductory post, and the invariance problem

[latexpage]

There are several topics I would like to talk about in the future posts, and they generally fall under the category of theoretical (or systems) neuroscience, and sometimes more broadly biological physics. The topics to be discussed include: the problem of invariance in theoretical neuroscience, Schrodinger’s take on physics of living matter and other modern thoughts on fundamental principles underlying biology (optimality principle, role of noise, etc), dynamics and computation: are they mutually exclusive concepts?, correlations (correlations in statistical physics, and the role of correlation in an ensemble of neurons), reinforcement learning, and more. I will start with the post on invariance.

## DPMs and Consistency

[latexpage]

Jeff Miller and Matthew Harrison at Brown (go Bears!) have recently explored the posterior consistency of Dirichlet process mixture (DPM) models, emphasizing one particular drawback.

For setup, say you have some observed data $x_1, …, x_n$ from a mixture of two normals, such as

$$p(x|\theta) = \frac{1}{2}\mathcal{N}(x|0,1) + \frac{1}{2}\mathcal{N}(x|6,1)$$

In this case, the number of clusters, $s$, is two, and one would imagine that as $n$ grows, the posterior distribution of $s$ would converge to 2, i.e. $p(s=2|x_{1:n}) \rightarrow 1$. However, this is not true if you model the data with a DPM (or more generally, modeling the mixing measure as a Dirichlet process, $Q \sim DP$).