Asymptotic Equipartition of Markov Chains
The Asymptotic Equipartition Property/Principle (AEP) is a well-known result that is likely covered in any introductory information theory class. Nevertheless, when I first learned about it in such a course, I did not appreciate the implications of its general form. In this post I will review this beautiful, classic result and offer the mental picture I have of its implications. I will frame my discussion in terms of Markov chains with discrete state spaces, but note that the AEP holds even more generally. My treatment will be relatively informal, and I will assume basic familiarity with Markov chains. See the references for more details.
The intuition for the AEP is most clear in the special case of i.i.d. samples. Consider a chain of samples from a categorical distribution over the states
described by the probability vector
. In this case each symbol will appear about
times with high probability, and in particular the joint probability of each chain will be about
. Interestingly,
where
is the entropy of the sampling distribution. Thus the probability of most trajectories sampled will be close to
.
Somewhat surprisingly, the same results holds for general (ergodic, stationary) Markov chains. In this case the AEP is stated (as in Shannon [1]): The set of possible length trajectories of a Markov chain
can be partitioned into two disjoint groups, one set whose total probability is arbitrarily small as
increases and another whose members each have a probability
with
. See Cover and Thomas [2] or the AEP wikipedia page for a proof. This result is called the “equipartition” principle because it says that most of the trajectories in the “typical” or likely set will have approximately equal probability, though I should note that MacKay [3] warns about taking this interpretation too literally.
The reason this is interesting is partly because you might think that as the length of a Markov chain grows there would be trajectories each with approximately equal probability. A moment’s thought shows this cannot be the case since trajectories consisting mainly of unlikely states will become far less likely. The AEP formalizes this intuition and gives a precise characterization of the number of likely trajectories. At one extreme a chain with zero entropy, which always samples the same state, will have only one trajectory with nonzero probability. At the other extreme a chain that always samples from the uniform distribution over all states will indeed have what one might naively suspect,
equally likely trajectories. The AEP nicely characterizes the rest of the spectrum.
The way I currently visualize the AEP is then as kind of a row property of Markov chains. Imagine laying out all possible trajectories of a length Markov chain,
, in an
matrix, where each row
is a trajectory with joint probability
. One important column property of this matrix is that the marginal distribution of the states in any arbitrary column
,
, will be the invariant distribution of the Markov chain. The AEP then describes an interesting row property: the probability mass of
is concentrated on only
of the
rows, and each of those rows has approximately equal probability,
. The AEP is thus a fascinating and deep result relating the entropy of a Markov chain to not only the joint probability of its trajectories but also the size of the effective support over its possible trajectories.
[1] Claude E. Shannon. “A Mathematical Theory of Communication”, Section 7. The Bell System Technical Journal. 1948.
[2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory, Chapters 3 and 16.8. Second edition. John Wiley & Sons, Inc. 2006.
[3] David J.C. MacKay. Information Theory, Inference, and Learning Algorithms, Section 4.4. Version 7.2. Cambridge University Press. 2005.