The Central Limit Theorem

Robert NishiharaBlog, Probability, Statistics

The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables X_1, X_2, \ldots. Define the random variable

(1)   \begin{equation*} A_N = \frac{X_1 + \cdots + X_N}{\sqrt{N}} \end{equation*}

to be a scaled sum of the first N variables in the sequence. Now, we would like to make interesting statements about the sequence

(2)   \begin{equation*} A_1, A_2, \ldots . \end{equation*}

The central limit theorem is quite general. To simplify this exposition, I will make a number of assumptions. First, I will assume that the X_i are independent and identically distributed. Second, I will assume that each X_i has mean \mathbb E[X_n] = 0 and variance \textnormal{var}(X_n) = \sigma^2. Finally, I will assume that the moment generating function (to be defined below) converges (this condition requires all moments of the distribution to exist).

Under these conditions, the central limit theorem tells us that

(3)   \begin{equation*} A_1, A_2, \ldots \to \mathcal N(0,\sigma^2) , \end{equation*}

where N(\mu, \sigma^2) is the normal distribution with density function

(4)   \begin{equation*} \mathcal N(x ; \mu, \sigma^2) = \frac{1}{ \sigma \sqrt{2 \pi}} \exp\left(-\frac{ (x-\mu)^2}{2\sigma^2}\right) , \end{equation*}

and where P_1 \to P_2 means that P_1([a,b]) \to P_2([a,b]) for all intervals [a,b] \subset \mathbb R. It is not immediately obvious that the sequence \{A_n\} should converge, but if it does converge, the normal distribution is the natural candidate. Suppose it converges to some distribution D. Now, consider the random variable

(5)   \begin{equation*} B_N = \frac{1}{\sqrt{2}} \left(\frac{X_1 + \cdots + X_N}{\sqrt{N}} + \frac{X_{N+1} + \cdots X_{2N}}{\sqrt{N}} \right) . \end{equation*}

Of course, B_N is just A_{2N}, so B_N \to D. But the first term on the RHS is A_N, and the second term on the RHS has the same distribution as A_N, so we have

(6)   \begin{equation*} B_N \to \frac{D + D}{\sqrt{2}} , \end{equation*}

where the two D‘s are independent random variables with the same distribution. So D and (D + D)/\sqrt{2} must have the same distribution. By grouping terms in different proportions, we can derive similar properties of D. Since the normal distribution satisfies

(7)   \begin{eqnarray*} \mathcal N(\mu_1, \sigma_1^2) + \mathcal N(\mu_2, \sigma_2^2) & = & \mathcal N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \\ c \mathcal N(\mu, \sigma^2) & = & \mathcal N(c\mu, c^2\sigma^2) , \end{eqnarray*}

it is a natural candidate for the distribution D.

To prove the central limit theorem, we will make use of the moment generating function

(8)   \begin{equation*} M_{X}(t) = \mathbb E [\exp (tX) ] = 1 + \mathbb E[X] t + \frac{1}{2} \mathbb  E[X^2] t^2 + \cdots \end{equation*}

and the cumulant generating function

(9)   \begin{equation*} K_{X}(t) = \log \mathbb E[\exp (tX) ] = \mathbb E[X] t + \frac{\mathbb E[X^2] - \mathbb E[X]^2}{2} t^2 + \cdots . \end{equation*}

The coefficients of the moment generating function and the cumulant generating function (divided by n!) are referred to as “moments” and “cumulants” respectively. The cumulants and the moments are closely related, and the values of one determine the values of the other. Incidentally, the moment generating function and the cumulant generating function of the normal distribution \mathcal N(\mu,\sigma^2) are given by

(10)   \begin{eqnarray*} M(t) & = & \exp\left(\mu t + \frac{\sigma^2}{2} t^2 \right) \\ K(t) & = & \mu t + \frac{\sigma^2}{2} t^2 . \end{eqnarray*}

Note that the moment generating function satisfies

(11)   \begin{eqnarray*} M_{X+Y}(t) & = & M_{X}(t) M_{Y}(t) \\ M_{cX}(t) & = & M_{X}(ct) . \end{eqnarray*}

It follows that the cumulant generating function satisfies

(12)   \begin{eqnarray*} K_{X+Y}(t) & = & K_{X}(t) + K_{Y}(t) \\ K_{cX}(t) & = & K_{X}(ct) . \end{eqnarray*}

Now, we are going to use all of these tools. Let’s inspect the cumulant generating function of A_N. We have

(13)   \begin{equation*} K_{A_N}(t) = K_{X_1}\left(\frac{t}{\sqrt{N}}\right) + \cdots + K_{X_N}\left(\frac{t}{\sqrt{N}}\right) . \end{equation*}

Let k^m_{X} be the mth coefficient of K_{X}(t). Equating powers of t above, we get

(14)   \begin{equation*} k^m_{A_N} = \left(k^m_{X_1} + \cdots + k^m_{X_N} \right) N^{-\frac{m}{2}} . \end{equation*}

From the case m=1, we see that

(15)   \begin{equation*} \mathbb E[A_N] = \frac{\mathbb E[X_1] + \cdots \mathbb E[X_N]}{\sqrt{N}} = 0, \end{equation*}

as expected. From the case m=2, we see that

(16)   \begin{equation*} \textnormal{var}(A_N) = \frac{\textnormal{var}(X_1) + \cdots \textnormal{var}(X_N)}{N} = \sigma^2, \end{equation*}

also as expected. But what about higher cumulants? For k > 2, as N \to \infty, we have

(17)   \begin{equation*} k^m_{A_N} = k^m_{X_1}  N^{1-\frac{N}{2}} \to 0 . \end{equation*}

Therefore, the higher cumulants all vanish. It follows that the cumulants of the sequence \{A_N\} converge to the cumulants of \mathcal N(0,\sigma^2). Therefore, the moments of the sequence \{A_N\} converge to the moments of \mathcal N(0,\sigma^2). It follows from Levy’s continuity theorem, that A_N \to \mathcal N(0,\sigma^2), as desired.