Suppose we are modeling a spatial process (for instance, the amount of rainfall around the world, the distribution of natural resources, or the population density of an endangered species). We've measured the latent function $Z$ at some locations $\mathbf{s}_1, \ldots, \mathbf{s}_N$, and we'd like to predict the function's value at some new location $\mathbf{s}_0$. Kriging is a technique for extrapolating our measurements to arbitrary locations. For an in-depth discussion, see Cressie and Wikle (2011). Here I derive Kriging in a simplified case.
I will assume that $Z$ is an intrinsically stationary process. In other words, there exists some semivariogram $\gamma({\bf h})$ such that
\[ \text{var}[Z(\mathbf{s}+{\bf h}) - Z(\mathbf{s})] = 2\gamma({\bf h}) . \]
Furthermore, I will assume that the process is isotropic, (i.e. that $\gamma({\bf h})$ is a function only of $||h||$). As Andy described here, the existence of a covariance function implies intrinsic stationarity. In addition, I will assume that the process has a constant mean, $\mathbb E[Z(\mathbf{s})] = \mu$. We would like to estimate $Z(\mathbf{s})$ with a linear combination of our current observations. Our estimator will be
\[ \hat{Z} = \sum_{n=1}^N \lambda_n Z(\mathbf{s}_n) ,\]
where the weights $\lambda_n$ can be positive or negative. We further require that $\sum_n \lambda_n = 1$ so that our estimate is unbiased. We would like to choose the weights $\lambda_n$ so as to minimize the mean-squared predictive error
\[ \text{MSPE}(\lambda_1, \ldots, \lambda_N) = \mathbb E[(\hat{Z} - Z(\mathbf{s}) )^2] .\]
Let $\gamma_{nm}$ denote $\gamma(\mathbf{s}_n - \mathbf{s}_m)$. Expanding the expression for the mean-squared predictive error, we get
\[ \sum_{n,m} \lambda_n \lambda_m \mathbb E[Z(\mathbf{s}_n)Z(\mathbf{s}_m)] + \mathbb E[Z(\mathbf{s}_0)^2] - 2\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n) Z(\mathbf{s}_0)] . \]
Adding and subtracting $\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)^2]$, this expression breaks into $A+B$, where
\[\begin{eqnarray*} A & = & -\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)^2] + \sum_{n,m} \lambda_n\lambda_m \mathbb E[Z(\mathbf{s}_n)Z(\mathbf{s}_m)] \\ & = & -\frac12 \sum_{n,m} \lambda_n\lambda_m \mathbb E[(Z(\mathbf{s}_n) - Z(\mathbf{s}_m))^2] \\ & = & -\sum_{n,m} \lambda_n\lambda_m \gamma_{nm} , \end{eqnarray*}\]and
\[\begin{eqnarray*} B & = & \mathbb E[Z(\mathbf{s}_0)^2] - 2\sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)Z(\mathbf{s}_0)] + \sum_n \lambda_n \mathbb E[Z(\mathbf{s}_n)^2] \\ & = & \sum_n \lambda_n \mathbb E[(Z(\mathbf{s}_0) - Z(\mathbf{s}_n))^2] \\ & = & 2 \sum_n \lambda_n \gamma_{0n}. \end{eqnarray*}\]We minimize the quantity $A+B$ subject to the constraint $\sum_n \lambda_n = 1$ using Lagrange multipliers. To simplify the notation, define the matrix $\Gamma$ by $\Gamma_{nm} = \gamma_{nm}$, the vector $\boldsymbol\lambda = (\lambda_1, \ldots, \lambda_N)^{\mathsf T}$, the vector $\boldsymbol\gamma_0 = (\gamma_{01}, \ldots, \gamma_{0N})^{\mathsf T}$, and the vector $\mathbf{1} = (1, \ldots, 1)^{\mathsf T}$. Then the mean-squared predictive error is given by
\[ -\boldsymbol\lambda^{\mathsf T} \Gamma \boldsymbol\lambda + 2\boldsymbol\lambda^{\mathsf T} \boldsymbol\gamma_0 .\]
Incorporating the Lagrange multiplier constraint, we have the quantity
\[ \Phi(\boldsymbol\lambda, \alpha) = -\boldsymbol\lambda^{\mathsf T} \Gamma \boldsymbol\lambda + 2\boldsymbol\lambda^{\mathsf T} \boldsymbol\gamma_0 - \alpha(\mathbf{1}^{\mathsf T}\boldsymbol\lambda - 1) .\]
Differentiating $\Phi$ with respect to $\alpha$ gives back our constraint. Differentiating with respect to each $\lambda_n$ and concatenating the resulting equations into matrix form gives
\[\begin{eqnarray*} & & -2\Gamma\boldsymbol\lambda + 2\boldsymbol\gamma_0 = \alpha \mathbf{1} \\ & \Leftrightarrow & \boldsymbol\lambda = \Gamma^{-1}(\boldsymbol\gamma_0 - \alpha \mathbf{1}/2) . \end{eqnarray*}\]Incorporating the constraint gives
\[\begin{eqnarray*} 1 & = & \mathbf{1}^{\mathsf T} \boldsymbol\lambda \\ & = & \mathbf{1}^{\mathsf T} \Gamma^{-1}(\boldsymbol\gamma_0 - \alpha \mathbf{1}/2) . \end{eqnarray*}\]Solving for $\alpha$ and plugging this back into our formula for $\boldsymbol\lambda$, we find that
\[ \boldsymbol\lambda = \Gamma^{-1}\left( \boldsymbol\gamma_0 - \frac{\mathbf{1}^{\mathsf T}\Gamma^{-1}\boldsymbol\gamma_0 - 1}{\mathbf{1}^{\mathsf T}\Gamma^{-1}\mathbf{1}}\mathbf{1} \right) .\]
This gives us our optimal Kriging predictor.