Exponential Families and Maximum Entropy

[latexpage]An exponential family parametrized by $\boldsymbol\theta \in \mathbb R^d$ is the set of probability distributions that can be expressed as

$p({\bf x} \,|\, \boldsymbol\theta) =\frac{1}{Z(\boldsymbol\theta)} h({\bf x}) \exp\left( \boldsymbol\theta^{\mathsf T}\boldsymbol\phi({\bf x}) \right) ,$

for given functions $Z(\boldsymbol\theta)}$ (the partition function), $h({\bf x})$, and $\boldsymbol\phi({\bf x})$ (the vector of sufficient statistics). Exponential families can be discrete or continuous, and examples include Gaussian distributions, Poisson distributions, and gamma distributions. Exponential families have a number of desirable properties. For instance, they have conjugate priors and they can summarize arbitrary amounts of data using a fixed-size vector of sufficient statistics. But in addition to their convenience, their use is theoretically justified.

Suppose we would like to find a particular probability distribution $p({\bf x})$. All we know about $p({\bf x})$ is that $\mathbb E_{p({\bf x})}[f_k] = c_k$ for a specific collection of functions $f_1, \ldots, f_K$. Which distribution should we use? The principle of maximum entropy asserts that when choosing among all valid distributions, we should choose the one with maximum entropy. In other words, we should choose the distribution which has the greatest uncertainty consistent with the constraints.

For simplicity, consider the case in which $p({\bf x})$ is a distribution over a finite set $\{x_1, \ldots, x_J\}$. Then $p({\bf x})$ can be written as a vector $(p_1, \ldots, p_J)$. We would like to maximize

$-\sum_{j} p_j \log p_j$

subject to the constraints $\sum_j p_j = 1$ and $\sum_j p_j f_k(x_j) = c_k$. Defining the quantity

\begin{eqnarray*}

\Lambda & = & -\sum_j p_j\log p_j + \lambda_0 \left( \sum_j p_j – 1\right) + \\

& & \sum_k \lambda_k\left( \sum_j p_j f_k(x_j) – c_k\right)

\end{eqnarray*}

and using Lagrange multipliers in the usual way, we end up with the constraints

\begin{eqnarray*}

0 & = & \frac{\partial \Lambda}{\partial p_j} = -\log p_j – 1 + \lambda_0 + \sum_k \lambda_k f_k(x_j) \\

0 & = & \frac{\partial \Lambda}{\partial \lambda_0} = \sum_j p_j – 1 \\

0 & = & \frac{\partial \Lambda}{\partial \lambda_k} = \sum_j p_j f_k(x_j) – c_k .

\end{eqnarray*}

Using only the first equality, we see that

$p_j \propto \exp\left( \sum_k \lambda_k f_k(x_j)\right) = \exp\left( \boldsymbol\lambda^{\mathsf T} \boldsymbol\phi(x_j) \right) ,$

where $\boldsymbol\lambda = (\lambda_1, \ldots, \lambda_K)^{\mathsf T}$ and $\boldsymbol\phi(x_j) = (f_1(x_j), \ldots, f_K(x_j))^{\mathsf T}$. This derivation shows that $p({\bf x})$ belongs to an exponential family.