This is a continuation of my last post about data compression and machine learning. In this post, I will start to address the question:
Does “good” compression generally lead to “good” unsupervised learning?
To answer this question, we need to start with another question:
What is a “good” compression algorithm?
Sadly, I will have to break this up even further, into two sub-questions:
1) What is a “good” lossless compression algorithm?
2) What is a "good" lossy compression algorithm?
-----------------------------------------------------------------
1) What is a "good" lossless compression algorithm?
The quick and tempting answer is one that achieves Shannon’s entropy bound. However, this requires knowing the the probability distribution of the data. In the types of situations I am discussing, we do not know the underlying distribution of the data. In a sense, if we did, then the problem (both unsupervised learning and compression) would already be solved.
If we do not know the probability distribution of the data, but have access to some data, then we say that a good lossless compression algorithm is one that works well empirically. This does not mean the algorithm that best compresses the available data -- that can lead to overfitting! (For example, each datum could be given an integer code, and then the message would be of size lg N where N is the number of data points, regardless of the size of the data vectors. This compression is useless for generalization to unseen data.) I think the empirical evaluation tools useful here are the same as those used in machine learning: held-out validation sets, etc. My feeling is that, in most practical compression situations, trying to model the density of the examples directly is not likely to be useful (e.g., trying to come up with a normalized density over images). Thus, the answer to question (1) is: whatever algorithm empirically compresses well. Note that we suffered from similar problems to machine learning (like overfitting), but instead of arbitrarily choosing an objective function, we are given one naturally, namely the compression ratio.
2) What is a “good” lossy compression algorithm?
Now things get more complicated. Again I am assuming we do not have access to the data distribution.
In the realm of lossy compression, some human-specified objective functions are needed. In particular, we need:
* a distortion function $D(x)$ that assigns a distortion to an output (given an input)
* a loss function $L(R(x), D(x))$ that assigns a score to each point on the rate-distortion curve (the trade-off between compression ratio and distortion)
Situation 1: We have $D(x)$ and $L(R,D)$
If we have a loss function over the rate-distortion space, then a good lossy compression algorithm is one that minimizes the total loss averaged over plausible inputs. In other words, the evaluation would be the same as in the lossless case, except that instead of trying to minimize bits we try to minimize the loss function that takes into account both size and distortion.
Situation 2: What if we don't have these loss functions?
If we do not have $D(x)$, we really cannot do an honest evaluation. Needless to say, I don't advocate blindly using Euclidean distance as the distortion metric. There are some proxies for $L(R,D) $however. For example we can fix $R$, the compression ratio, and try to minimize $D$, the distortion.
Other approaches to finding a "good" lossy compression algorithm implicitly define these objective functions without stating them outright. I endorse clearly stating one's assumptions, typically in the form of objective functions. As David McKay writes in his book, "you can't do inference - or data compression - without making assumptions." To reiterate, if you know the underlying distribution of the data, then you can attempt to construct an ideal lossless compression algorithm without making assumptions. But in any other (real) situation, assumptions (encoded in these loss functions) are required. Even in the lossless case where you do not need to explicitly choose a loss function, you are still making assumptions. For example the assumption that your validation set, training set, and unseen data come from the same distribution.
This post attempts to discuss the challenges of data compression and how to start thinking about what makes a "good" data compression algorithm. Some ties to machine learning have emerged, such as the issue of overfitting and the necessity for making assumptions (whether explicit or implicit). In the next post I will attempt to tie this back into similarities with unsupervised learning.