Hi, I’m Jon. In my series of posts, I’ll be writing about how we can use the modern Bayesian toolkit to efficiently make decisions, solve problems, and formulate plans (the providence of AI), rather than restrict ourselves to approximating posteriors (the providence of statistics and much of machine learning).
Here’s a simple example of how AI can help out machine learning. What was the first graphical model you were exposed to? There’s a good chance it was Pearl’s famous “Sprinkler, Rain, Wet grass” graphical model[1]. To remind you, we imagine it’s either raining or not and your sprinkler is either on or not. The sprinkler usually turns off when it rains, but not always. The lawn will usually get wet if it’s either raining or the sprinkler is on, and will almost definitely get wet if both happen. There are three nodes in our graphical model and seven numbers suffice to describe all the conditional probabilities. It’s not too hard to learn those parameters from a small dataset of observations, or to perform inference on any subset of the variables, given the others.
Let’s expand this model a little bit. Imagine there are many people, and the government is trying to figure out who's been illegally using their sprinklers during a drought, given knowledge of whose lawn is wet. Each person has a variable amount of neighbors, and people live scattered across different neighborhoods. If it’s raining in one neighborhood, it’s probably raining in an adjacent neighborhood. If your sprinkler is on, it might get your neighbor’s lawn wet. What would the graphical model look like to model this situation with two people? You would need one node to indicate if they are neighbors, another node to indicate if they live in the same neighborhood or adjacent neighborhoods or unadjacent neighborhoods, and then nodes to represent the wetness of their respective lawns and sprinklers. What about with three people? You would need a node to indicate if the first and second person were neighbors, if the first and third were neighbors, if the second and third were neighbors, etc. What if wanted to model Houston (a city which loves its lawns)? With a population of about 6 million people, we are talking something like $6^{12}[/latex] nodes. Considering the NP-hardness of general approximate inference, that is not good if we hope to use this model for reasoning about sprinklers, given knowledge of which lawns are wet. If we were to draw this network, we might use plate notation to help visualize it, but that is only a visualization technique which does not prevent the underlying combinatoric explosion from making inference intractable in this system.
Let’s use a knowledge representation known as Horn logic[2] (a subset of first-order logic) to express this model more succinctly, as opposed to the essentially propositional system used above. I need to introduce a few relations: Neighbor(i,j) is true iff person ‘i’ is a neighbor of person ‘j’. Wet(i) indicates if person ‘i's lawn is wet, and Sprinkler(i) indicates if person ‘i’s sprinkler is on. Adjacent(x, y) indicates if neighborhood x and y are adjacent, and Rain(x) indicates if it is raining in neighborhood ‘x’. Neighberhood(i, x) is true if person ‘i’ lives in neighborhood ‘x’. Then we have the following rules (which are only true with some probability):
If Neighbor(i,j) and Sprinkler(i) or Sprinkler(j), then Wet(i) and Wet(j)
If Rain(x) and Adjacent(x,y), then Rain(y)
If Rain(x) and Neighbood(i,x), then Wet(i)
Those three rules sure beats a trillion nodes. The improvement is more than stylistic: so-called lifted inference algorithms are able to take advantage of this more abstract representation to make inference tractable in situations where the propositional representation fails. In essence, these algorithms work by first grounding the abstract rules into a standard graphical model when given a'domain of discourse' (in our example, the number of people and neighborhoods under consideration). The trick is they only instantiate the part of the graphical model necessary to answer a given query and they take advantage of the parameter-tying between the various conditional probabilities implied by the logical rules. There is no consensus in the probabilistic logic community on how exactly formalize the semantics of assigning probabilities to rules, but one system that has gained popularity is the Markov Logic framework of Pedro Domingos[3]. His group has a high-quality and widely-used graphical toolkit you can check out to use lifted inference in your own models called Alchemy[4].
A fun thing to think about is if we can get even more abstract than first-order logic and thereby have a kind of lifted lifted inference. You may notice these relations look a lot like functions in a programming language. What if we took all the abstraction offered by a programming language to compress graphical models? This relates to probablistic second-order logical systems (where we consider relationships of relationships) and to universal inference engines like Church[5], but that is a topic for another blog post.
In my next post, I’ll be talking about the duality between planning problems and Bayesian inference. Today’s high-performance MCMC samplers may be tomorrow’s first Go grandmasters.
[1] Pearl, Judea. Causality: Models, Reasoning, and Inference.
[2] Horn Logic is a logical system where each rule is of the form (If x is true and y is true and .. is true, then z is true). No negations are allowed. Our intuitive notion of 'rules' fits well with Horn clauses.
[3] Richardson, Matthew and Domingos, Pedro. Machine Learning. Feb 2006.
[4] Alchemy website
[5] Church MIT wiki