Asymptotic Equipartition of Markov Chains

Peter KrafftStatisticsLeave a Comment

The Asymptotic Equipartition Property/Principle (AEP) is a well-known result that is likely covered in any introductory information theory class. Nevertheless, when I first learned about it in such a course, I did not appreciate the implications of its general form.  In this post I will review this beautiful, classic result and offer the mental picture I have of its implications. I will frame my discussion in terms of Markov chains with discrete state spaces, but note that the AEP holds even more generally. My treatment will be relatively informal, and I will assume basic familiarity with Markov chains. See the references for more details.

The intuition for the AEP is most clear in the special case of i.i.d. samples. Consider a chain of N samples from a categorical distribution over the states 1,\ldots,S described by the probability vector \vec{p}. In this case each symbol will appear about p_s N times with high probability, and in particular the joint probability of each chain will be about p = p_1^{p_1 N}p_2^{p_2 N}\cdots p_S^{p_S N}. Interestingly, \log p = N\sum_s p_s \log p_s = -NH, where H is the entropy of the sampling distribution. Thus the probability of most trajectories sampled will be close to 2^{-NH}.

Somewhat surprisingly, the same results holds for general (ergodic, stationary) Markov chains.  In this case the AEP is stated (as in Shannon [1]): The set of possible length N trajectories of a Markov chain X_1 = x_1, \ldots, X_N = x_N can be partitioned into two disjoint groups, one set whose total probability is arbitrarily small as N increases and another whose members each have a probability p = P(X_1 = x_1, \ldots, X_N = x_N) with \frac{\log p^{-1}}{N} \approx H. See Cover and Thomas [2] or the AEP wikipedia page for a proof. This result is called the “equipartition” principle because it says that most of the trajectories in the “typical” or likely set will have approximately equal probability, though I should note that MacKay [3] warns about taking this interpretation too literally.

The reason this is interesting is partly because you might think that as the length of a Markov chain grows there would be S^N trajectories each with approximately equal probability. A moment’s thought shows this cannot be the case since trajectories consisting mainly of unlikely states will become far less likely. The AEP formalizes this intuition and gives a precise characterization of the number of likely trajectories. At one extreme a chain with zero entropy, which always samples the same state, will have only one trajectory with nonzero probability. At the other extreme a chain that always samples from the uniform distribution over all states will indeed have what one might naively suspect, S^N equally likely trajectories. The AEP nicely characterizes the rest of the spectrum.

The way I currently visualize the AEP is then as kind of a row property of Markov chains.  Imagine laying out all possible trajectories of a length N Markov chain, X_1, \ldots, X_N, in an S^N \times N matrix, where each row i is a trajectory with joint probability P(X_1 = x_{i1}, \ldots, X_N = x_{iN}). One important column property of this matrix is that the marginal distribution of the states in any arbitrary column j, P(X_j = x) = \sum_{i} P(X_1 = x_{i1}, \ldots, , X_j = x, \ldots, X_N = x_{iN}), will be the invariant distribution of the Markov chain. The AEP then describes an interesting row property: the probability mass of P(X_1, \ldots, X_N) is concentrated on only 2^{NH} of the S^N rows, and each of those rows has approximately equal probability, 2^{-NH}. The AEP is thus a fascinating and deep result relating the entropy of a Markov chain to not only the joint probability of its trajectories but also the size of the effective support over its possible trajectories.

 

[1] Claude E. Shannon. “A Mathematical Theory of Communication”, Section 7. The Bell System Technical Journal. 1948.
[2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, Chapters 3 and 16.8. Second edition. John Wiley & Sons, Inc. 2006.
[3] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms, Section 4.4. Version 7.2. Cambridge University Press. 2005.

Leave a Reply