Markov chain centenary

Elaine Angelino · January 23, 2013

Machine Learning Meta

I just attended a fun event, Celebrating 100 Years of Markov Chains, at the Institute for Applied Computational Science. There were three talks and they were taped, so hopefully you will be able to find the videos through the IACS website in the near future. Below, I will review some highlights of the first two talks by Brian Hayes and Ryan Adams; I'm skipping the last one because it was more of a review of concepts building up to and surrounding Markov chain Monte Carlo (MCMC).

The first talk was intriguingly called "First Links in the Markov Chain: Poetry and Probability" and was given by Brian Hayes, who also organized today's event and is a Senior Writer at American Scientist magazine. This was a fascinating lecture on the history behind A. A. Markov's work, "An Example of Statistical Investigation of the Text 'Eugene Onegin' Concerning the Connection of Samples in Chains" (January 23, 1913). Hayes began by using the game Monopoly as well as a toy model of weather to introduce Markov chains, (powers of) transition matrices and convergence to stationarity. I especially recommend this part of his talk if you would like to see an excellent example of how to clearly, concisely and concretely present technical material to a general audience. Before diving into some history, Hayes summarized the importance of MCMC in science today by invoking a quote from Persi Diaconis (2009): "To someone working in my part of the world, asking about applications of [MCMC] is a little like asking about applications of the quadratic formula. The results are really used in every aspect of scientific inquiry."

Hayes's history lesson began with Peter the Great's founding of the Academy of Sciences at St. Petersburg, where the likes of Euler and a couple of Bernoulli brothers worked during the 18th century. These foreigners gave way to more Russians in the 19th century, including Chebyshev, and later his student Markov (1856-1922). Incredibly, Markov's work during the early 20th century on convergence of transition matrices was apparently sparked by an argument for free will by Nekrasov, a theologian-turned-mathematician.

The connection between Markov chains and poetry comes through Markov's analysis of Alexander Pushkin's "Eugene Onegin", described by Hayes as "the Russian anthem poem that schoolchildren memorize" and highly recommended by him. Markov took the poem's first 70 stanzas (20,000 characters), removed punctuation and white spaces, counted (!) pairs of characters and calculated their moments. This analysis quantitatively demonstrated a strong tendency to alternate vowels and consonants, which makes sense when you think about pronunciation.

I've left out a lot of historical, political and dramatic details from Hayes's talk -- including the connection to Shannon -- so I hope you have a chance to enjoy a video of his lecture!

Next, Ryan Adams gave a broad overview of how graphical models arise from Markovian concepts, entitled, "From Markov to Pearl: Conditional Independence as a Driving Principle for Probabilistic Modeling". His talk includes lots of recent examples and references; here, I'd like to summarize the topics covered because I think a lot of people will enjoy this talk as a useful introduction to graphical models.

Adams begins with joint probability distributions and their connection to conditional probability distributions through Bayes's Theorem -- he uses a great little computer vision example concerning a photograph of cows and the question, What is in the foreground? Next, he considers two extreme models of (word) sequences to motivate Markov chains: the full joint distribution is really complicated, while the fully factored distribution loses a lot of information. Assuming that each word depends only on the preceding word -- as Markov did with sequences of letters -- yields a model that is only quadratic in the size of the vocabulary.

This Markov property, sometimes referred to as "memorylessness", is a particular case of conditional independence that we can represent through probabilistic graphical models. In these graphs, vertices represent random variables and edges represent conditional distributions; importantly, they aren't restricted to chains but can be any directed acyclic graph. (The lecture slides have lots of useful pictures!) Adams goes over different kinds of information flow in these graphs and covers undirected graphical models (also called Markov random fields) and factor graph models.

A major point that Adams makes is that in addition to the conceptual benefits of graphical models, they also lead to great computational advantages. In particular, the graph -- which recall specifies information flow -- enables inference via efficient message passing algorithms composed of local computations that can be asynchronously updated in parallel. Adams provides a clear example of how to use message passing to compute a marginal distribution.
Efficient online changepoint detection -- important in many applications such as manufacturing and robotics -- exemplifies the power of combining graphical models and belief propagation (Adams and MacKay, 2007) and can be implemented in "a few lines of MATLAB". Adams also mentions loopy belief propagation and makes connections to error correcting codes and data clustering.

Finally, Adams turned to MCMC for graphical models, noting that it can be made efficient through local updates -- basically, you look at your neighbors to do a Gibbs sampling update. Depending on graph's structure, you can try to find colorings that lead to efficient parallel updates; one example is a Restricted Boltzmann machine (RBM), which has bipartite structure.

Share: Twitter, Facebook