Discriminative (supervised) Learning

Shamim Nemati · December 26, 2012

Often the goal of inference and learning is to use the inferred marginal distributions for prediction or classification purposes. In such scenarios, finding the correct “model structure” or the true “model parameters”, via maximum-likelihood (ML) estimation or (generalized) expectation-maximization (EM), is secondary to the final objective of minimizing a prediction or a classification cost function. Recently, I came across a few interesting papers on learning and inference in graphical models by direct optimization of a cost function of the inferred marginal distributions (or normalized beliefs) [1, 2, 3, 4]:

$$ e = C( outcomes, f(bs); \Theta),$$

where f is a differentiable function that maps the beliefs (bs) to the outcomes/labels of interest, $ \Theta $ is a set of model parameters, and C is a differentiable cost function that penalizes for incorrect classifications or prediction.

The main idea of discriminative learning is to use “back-propagatation” to calculate the gradient of the error e with respect to the marginals and model parameters. In direct analogy to learning in neural networks, one has to keep track of every step of inference (forward pass) and then reverse the operations (reverse mode automatic differentiation) to calculate the error gradient, which will be used for minimization of the classification/prediction error.

I saw one of the first references to this idea in Memisevic (2006) [1] under the so called structured discriminative learning. In Eaton and Ghahramani (2009) [2], error back-propagation with respect to belief-propagation operations was used within a Cutset Conditioning algorithm  (turning a complex graphical model into a simpler one by conditioning on a set of variables) to decide which variables to condition on. Domke [3] and Stoyanov et al. [4], in the context of learning in Markov random fields, showed that discriminative/supervised learning using error back-propagatation can yield better results than likelihood maximization techniques.

In summary, the approach discussed above allows for applying optimization techniques from the neural network literature to learning in some classes of graphical models, and conversely, to use ML and EM-based techniques to initialize equivalent neural network representations of graphical models for supervised training. The latter may resolve some of the traditional issues with initialization and generalization performance of neural networks, and is related to the recent developments in Deep-Learning.

[1] R. Memisevic, "An introduction to structured discriminative learning," Citeseer, Tech. Rep., 2006.
[2] F. Eaton and Z. Ghahramani, "Choosing a variable to clamp: approximate inference using conditioned belief propagation," In Proceedings of AISTATS, vol. 5, pp. 145–152, 2009.
[3] J. Domke, "Beating the likelihood: Marginalization-based parameter learning in graphical models."
[4] Stoyanov, Veselin, Alexander Ropson, and Jason Eisner. "Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure." Proceedings of AISTATS. 2011.

Twitter, Facebook