Ryan Adams and I just uploaded to the arXiv our paper ``High-Dimensional Probability Estimation with Deep Density Models''. In this work, we introduce the *deep density model* (DDM), a new approach for density estimation.

As a motivating prequel, let's think about why we are interested in having density estimates in the first place. In his latest post, Michael alluded to the existence of a very deep connection between compression and unsupervised learning. Indeed, the goal of unsupervised learning is to characterize structure in our data, and this knowledge can then be used to establish summarized representations by making informed assumptions about the redundancy of unobserved data. Density estimation links these two concepts very explicitly. Allocation of probability mass throughout our space precisely quantifies our belief of the likelihood of occurrence of every point in it, and as such obviates our assumptions about structure in the data. Compression then consists of representation of the data under the assumption of absence of information in the regions of our space to which we've assigned low density. It is true that for almost every task we might be interested in, such as classification and unsupervised learning, there's some algorithm that confronts it without any mention of the word ``probability''. However, without a probabilistic interpretation, it is not clear exactly which assumptions such algorithm might make, and further, how these might be altered in order to improve the performance or speed of the algorithm.

Many techniques have been proposed to study patterns in data and to further provide density estimates. Probabilistic graphical models such as the Boltzmann machine produce powerful density estimates, but these are rather unnormalized. Bayesian nonparametric approaches are also very flexible, but require costly inference procedures. Manifold discovery methods such as the autoencoder produce useful representations of the data, but have no clear probabilistic interpretation. The manifold Parzen windows approach directly models the distribution of the data in the observed space, but is thwarted by the curse of dimensionality. In our work, we propose a way to provide fully-normalized density estimates on high-dimensional spaces.

Let's say we're given observations $\mathbf{y}_1,\ldots,\mathbf{y}_N\in\mathcal{Y}\subseteq\mathbb{R}^K$. The existence of structure in the data corresponds to their underlying probability distribution $p_{\mathbf{Y}}(\cdot)$ allocating most of its density to a region of small volume within the observed space. However, for interesting data with high dimensionality, this distribution of mass is horrendously complicated. As such, directly applying our tools on it would make a very challenging problem, especially due to the plaguing curse of dimensionality. However, let's take a step back and imagine an ideal situation we would love to be in. Let's say that we are bestowed with a bijective map $\boldsymbol{f}:\mathcal{X}\rightarrow\mathcal{Y}$ from a *representation space* $\mathcal{X}$ that generates our observations. Namely, we assume the transformation of the true distribution to this space, $p_{\mathbf{X}}(\cdot)$, is factorized and has known marginals $p_{X_k}(\cdot), k=1,\ldots,K$. We really couldn't ask for more: given an observation, we could easily compute its fully-normalized density, since we have access to its inverse transformation and its density, as well as to the determinant of the Jacobian which tells us how measures are contracted under $\boldsymbol{f}$:
$$\begin{align}
p_{\mathbf{Y}}(\mathbf{y}) &= \prod_{k=1}^K
p_{X_k}([\boldsymbol{f}^{-1}(\mathbf{y})]_k)\,\left|\frac{\partial\boldsymbol{f}(\mathbf{x})}{\partial\mathbf{x} }\right|_{\boldsymbol{f}^{-1}(\mathbf{y})}.
\end{align}
$$
Having access to the latent distribution also permits sampling from the observed distribution, as observed samples are deterministic functions of their latent representations.

In this work, we search for a bijective map under which the implied distribution on the representation space has an approximately factorized form with known marginals. In other words, we convert the problem of capturing the complexity of the observed distribution to the problem of capturing this complexity in a map that gives rise to this distribution. To tackle this problem, we summon tools from deep learning, which allow us to construct and optimize over a rich class of parametrized transformations to a representation space of equal dimension. To achieve these best-case properties we outlined earlier, we introduce novel penalties and procedures that, as a function of our class parametrization, (1) ensure optimization only over bijective transformations; (2) sculpt the latent marginal densities to forms of our choice; and (3) suppress dependencies between latent dimensions.

This process results in the deep density model, which allows us to compute to fully-normalized probability densities without a partition function and sample from the observed space without MCMC. This opens the door to addressing a variety of other tasks on high-dimensional data, such as constructing Bayesian classifiers; now, due to the calibration of the densities, these can express their confidence in owning observations in their class. Also, semi-supervised learning can be performed by incorporating density information to construct mixtures of deep density models and running algorithms such as expectation-maximization.

An information-theoretical perspective further allowed us to shed light on and quantify the intimate interplay between sparsity of representation and independence of representation dimensions. We can explicitly write down the (fixed and for now unknown) entropy of the data in the observed space in terms of the latent marginal entropies, the parametrized transformation, and the latent joint mutual information:

\begin{eqnarray*}

\mathcal{H}(p_{\mathbf{Y}}(\cdot))&=&\sum_{k=1}^K \mathcal{H}(p_{X_k}(\cdot))

- D(\prod_{k=1}^K p_{X_k}(\cdot)\,||\,p_{\mathbf{X}}(\cdot))\\

&&+\;\mathbb{E}_\mathcal{X} \left[ \log \left| \frac{\partial\boldsymbol{f}(\cdot)}{\partial\mathbf{y}} \right| \right] \;.

\end{eqnarray*}

Since we have control of the latent marginals and the transformation (via the parametrization), this not only permits us to bound the latent dependencies, but also suppress them. This, in turn, allows us to bound the entropy of the data in the observed space, and furthermore, to approximately characterize it. In addition, in the paper, we discuss how such analysis enables, for example, answering the question of how to choose the appropriate level of sparsity without the use of black magic.