Asymptotic Equipartition of Markov Chains
Peter Krafft · January 3, 2013
The Asymptotic Equipartition Property/Principle (AEP) is a well-known result that is likely covered in any introductory information theory class. Nevertheless, when I first learned about it in such a course, I did not appreciate the implications of its general form. In this post I will review this beautiful, classic result and offer the mental picture I have of its implications. I will frame my discussion in terms of Markov chains with discrete state spaces, but note that the AEP holds even more generally. My treatment will be relatively informal, and I will assume basic familiarity with Markov chains. See the references for more details.
The intuition for the AEP is most clear in the special case of i.i.d. samples. Consider a chain of $N$ samples from a categorical distribution over the states $1,\ldots,S$ described by the probability vector $\vec{p}$. In this case each symbol will appear about $p_s N$ times with high probability, and in particular the joint probability of each chain will be about $p = p_1^{p_1 N}p_2^{p_2 N}\cdots p_S^{p_S N}$. Interestingly, $\log p = N\sum_s p_s \log p_s = -NH,$ where $H$ is the entropy of the sampling distribution. Thus the probability of most trajectories sampled will be close to $2^{-NH}$.
Somewhat surprisingly, the same results holds for general (ergodic, stationary) Markov chains. In this case the AEP is stated (as in Shannon [1]): The set of possible length $N$ trajectories of a Markov chain $X_1 = x_1, \ldots, X_N = x_N$ can be partitioned into two disjoint groups, one set whose total probability is arbitrarily small as $N$ increases and another whose members each have a probability $p = P(X_1 = x_1, \ldots, X_N = x_N)$ with $\frac{\log p^{-1}}{N} \approx H$. See Cover and Thomas [2] or the AEP wikipedia page for a proof. This result is called the "equipartition" principle because it says that most of the trajectories in the "typical" or likely set will have approximately equal probability, though I should note that MacKay [3] warns about taking this interpretation too literally.
The reason this is interesting is partly because you might think that as the length of a Markov chain grows there would be $S^N$ trajectories each with approximately equal probability. A moment's thought shows this cannot be the case since trajectories consisting mainly of unlikely states will become far less likely. The AEP formalizes this intuition and gives a precise characterization of the number of likely trajectories. At one extreme a chain with zero entropy, which always samples the same state, will have only one trajectory with nonzero probability. At the other extreme a chain that always samples from the uniform distribution over all states will indeed have what one might naively suspect, $S^N$ equally likely trajectories. The AEP nicely characterizes the rest of the spectrum.
The way I currently visualize the AEP is then as kind of a row property of Markov chains. Imagine laying out all possible trajectories of a length $N$ Markov chain, $X_1, \ldots, X_N$, in an $S^N \times N$ matrix, where each row $i$ is a trajectory with joint probability $P(X_1 = x_{i1}, \ldots, X_N = x_{iN})$. One important column property of this matrix is that the marginal distribution of the states in any arbitrary column $j$, $P(X_j = x) = \sum_{i} P(X_1 = x_{i1}, \ldots, , X_j = x, \ldots, X_N = x_{iN})$, will be the invariant distribution of the Markov chain. The AEP then describes an interesting row property: the probability mass of $P(X_1, \ldots, X_N)$ is concentrated on only $2^{NH}$ of the $S^N$ rows, and each of those rows has approximately equal probability, $2^{-NH}$. The AEP is thus a fascinating and deep result relating the entropy of a Markov chain to not only the joint probability of its trajectories but also the size of the effective support over its possible trajectories.
[1] Claude E. Shannon. "A Mathematical Theory of Communication", Section 7. The Bell System Technical Journal. 1948.
[2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, Chapters 3 and 16.8. Second edition. John Wiley & Sons, Inc. 2006.
[3] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms, Section 4.4. Version 7.2. Cambridge University Press. 2005.