Before diving into more technical posts, I want to briefly touch on some basic questions, and the big picture behind unsupervised learning. I also want to do a bit of handwaving on sparsity—a topic that has gotten a lot of attention recently.

Let’s say we are given observations $\mathbf{y}_1,\ldots,\mathbf{y}_N\in\mathbb{R}^D$. These points are assumed to contain some underlying structure, which we seek to capture in order to perform tasks such as classification or compression. We can apply our algorithms on the data in their raw form—which carries unidentified redundancy—and hope for the best. However, a more sensible approach would be to first transform (not necessarily bijectively) the data into a different space via an *encoder* map $\mathbf{f}:\mathbb{R}^D\rightarrow\mathbb{R}^K$, which we choose to “reshuffle” the information to accentuate interesting patterns in the inputs. We call $\mathbf{x}_n=\mathbf{f}(\mathbf{y_n}),n=1,\ldots,N$ *representations* of the original data, and now apply our analysis to these instead. To start, we can apply a fixed mapping that we hope will reveal such patterns. Alternatively, we can first define criteria—in the form of constraints, or a cost function—for what makes a “good” representation, and optimize over a large class of parametrized maps to find one that best obeys these. Examples of desirable representation properties are linear independence between representation dimensions, or uniform marginalizations. Maps that achieve these can be found via PCA or the probability integral transform, respectively.

Another very interesting representation criterion is sparsity. A sparse representation of a dataset is one for which in each vector $\mathbf{x}_n$, most of the elements are forced to be zero. The select few that are not are meant to reflect distinguishing features of that example. Let’s think about this by considering a standard stacked autoencoder, where the encoder map constitutes of repeated compositions of sigmoidal nonlinearities and linear transformations. In this case, no element in the representation vector will be exactly zero, but we can introduce regularization to ensure lots of elements are very close to it. Can we still maintain true sparsity if we relax the requirement that most elements are exactly zero by allowing these elements to be *almost* zero (i.e, constraining their magnitudes to be less than some small $\varepsilon>0$)?

The answer is, in general, no. A misconception is that sparsity is a variational bound on element magnitudes, but it is a much deeper statement than that. It says that the sparse elements aren’t allowed to capture much—or possibly any—information. More specifically, this demands invariance of the input examples as function of perturbations of the sparse elements, or, in other words, flatness of the input space along sparse directions in the representation space, at the training examples. As such, the indistinguishability of the sparse elements gives rise to an information bottleneck, which is exactly what sparsity is about in the first place.

Returning to our autoencoder question, we see that a 0-th order variational bound isn’t sufficient. Sure, the “sparse” elements will pushed close to 0 by some regularization, but they will nevertheless carry nontrivial information content! The only difference is that the decimal place will be shifted. There’s really nothing special about the number zero: we could’ve just as well induced sparsity around $\pi$. The decoder of the stacked autoencoder will be able to pick up a complex enough of a map that it will be able to differentiate between distinct values in the sparse regime, and thus undo the pseudo-sparse compression into the epsilon-ball around zero. It’s even worse than that: not only the sigmoid nonlinearities don’t give us true zeros, these maps are asymptotically flat (that is, from the input space into representation space) as the representation elements gets close to zero! Had our decoder been the inverse of the encoder, we would’ve gotten exactly the opposite of the flatness property we actually want.

So, which nonlinearities can we instead use to induce true sparsity? Check out the rectifying nonlinearity as a (generally non-invertible) example of such map.