Data compression and unsupervised learning are two concepts whose relationship is perhaps underappreciated. Compression and unsupervised learning are both about finding patterns in data -- but, does the similarity go any further? I argue that it does.
As an example, consider the k-means clustering algorithm. This is typically thought of as an unsupervised learning method, but it is a compression method too. The encoder performs k-means clustering, and then transmits the k cluster centers, and for each data vector its cluster membership (an integer between 0 and k-1) and the delta vector between its cluster center and itself. Assuming a reasonable clustering, the difference vectors will generally be much smaller in magnitude (and thus in bits) than the original vectors. Therefore, if k << N, the above data can be much smaller than the original representation. Note that I needed to assume a “reasonable clustering” to make this argument. Thus, our intuition about k-means doing a “good job” of unsupervised learning bears some resemblance to our intuition about it doing a “good job” of compression. We can start by considering the extremes of k=1 clusters and k=N clusters, both of which would generally not do a “good job” of unsupervised learning. For k=N, each data vector is its own cluster, and therefore no compression will occur (in fact the encoded file will be larger). For k=1, there is only one cluster, and so again compression will be minimal because most vectors will typically be far from the single cluster center. As the clustering becomes more “reasonable”, compression seems to improve.
This discussion lends itself to some interesting questions: does “good” compression generally lead to “good” unsupervised learning? What about the converse? Can we quantify this effect? More generally, by thinking about these two concepts in tandem, can we learning something new? In my following posts I will attempt address these issues in more detail. I will start by considering what makes a “good” compression algorithm and what makes a “good” unsupervised learning algorithm. I will also try to discuss how the difference between lossless and lossy compression plays into this story.