One approach to AI research is to work directly on applications that matter — say, trying to improve production systems for speech recognition or medical imaging. But most research, even in applied fields like computer vision, is done on highly simplified proxies for the real world. Progress on object recognition benchmarks — from toy-ish ones like MNIST, NORB, and Caltech101, to complex and challenging ones like ImageNet and Pascal VOC — isn’t valuable in its own right, but only insofar as it yields insights that help us design better systems for real applications.
So it’s natural to ask: which research results will generalize to new situations?
One kind of result which probably won’t generalize is: “algorithm A works better than algorithm B.” Different application areas have their own requirements. Even for a single task like object recognition, different datasets have different properties which affect how well different approaches perform. We’ve learned some things from seeing how different approaches perform in aggregate across a wide range of benchmarks; for instance, random forests are a very good general purpose classifier, and ensembles work much better than single classifiers. Benchmarks are also good for identifying talented individuals who companies should hire. But when it comes to improving our algorithms, I think it’s hard to learn very much by comparing the final performances of different approaches.
The kind of result I believe generalizes to new situations is the nature of the tradeoffs between different approaches. Consider some of what we know about neural nets and deep learning. Compared with contrastive divergence, persistent contrastive divergence gives more accurate samples, at the cost of higher variance updates. Tempered transitions is more accurate still, but much more expensive. Hessian-free optimization is better than stochastic gradient descent at accounting for curvature, but it is much harder to implement and each iteration is far more expensive. Dropout attenuates overfitting, but at the expense of higher variance in the gradients.
None of these approaches is uniformly better than its alternatives. Datasets vary a lot in factors like size, complexity, auxiliary information sources, and labeling noise which affect the tradeoffs between algorithms. For larger datasets, regularization may be less of a concern, and computation time more critical. Computational resources also make a difference, e.g. GPUs have shifted the balance in favor of dense matrix multiplies. All of these factors are likely to change between the benchmark and the practical setting, so the arg max over all the algorithmic choices is likely to change as well. (This is why Netflix never wound up using the prize-winning algorithm.) But the nature of the tradeoffs themselves seems to hold up pretty well.
Knowing the tradeoffs doesn’t give a recipe for machine learning, but it does give a strategy for designing algorithms. Researchers have generally worked on several problems or datasets, and we know which algorithms work well and which factors are important on these datasets. Using these as reference points, we think about how the new problem differs (is it larger? noisier?), and that tells us in which directions to adjust. From this starting point, we can run diagnostics to tell us which issues are hurting performance and what useful sources of information are our methods overlooking, and this will inform our choice of algorithms.
This is part of why companies are spending so much money to hire top machine learning researchers. Quite a lot is known about machine learning, but the knowledge is still highly context dependent. (It’s a good thing for us researchers that “A is better than B” results don’t generalize; otherwise, there would be One Algorithm to Rule them All, and little room for new insight.)
In defense of toy problems
It may seem obvious that if we can’t replicate the real world in our benchmarks, we should at least try to get as close as possible. Shouldn’t we be working with large image datasets as representative as possible of the distribution of all images? Shouldn’t we build the very best performing algorithms that we can on these problems, so that they will be close to what’s used in practice? The problem is, the larger the dataset and the more complex the algorithm, the harder it is do to science.
Brian Kernighan once said, “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” But running careful and meaningful experiments is harder still than debugging. You need to vary a lot of factors and measure their effect, which requires running the algorithm many more times than if you simply wanted to use it. Controlling for confounding factors requires a clear understanding of how everything fits together. If your algorithm is at the frontier of what you can implement, or of what can be run on modern computing technology, then how will you ever run experiments with it?
Trying to bust the hardest benchmarks usually pushes us to the frontier of what our brains and our computers can handle. This reflects itself in the sorts of experiments we run. If you look through recent deep learning or Bayesian nonparametrics papers which use toy datasets, you’ll find carefully controlled experiments which vary one factor and show that this makes a difference. Authors are expected not just to show that their method works better, but also to present evidence for why it works better. But with the most challenging benchmarks, authors typically compare their final performance against numbers previously published in the literature. These numbers were obtained using a completely different system, and it’s often hard to understand the reason for the improvement.
I briefly worked on an object detection project with Joseph Lim, using the Pascal VOC dataset. Like most people working on object detection at the time, we built on top of Pedro Felzenswalb’s deformable part model software package. For reasons I won’t go into, we tried replacing the SVM package it used with a different SVM package, and this led to a huge drop in performance. This seemed nonsensical — both packages were optimizing the same convex objective, so shouldn’t they be interchangeable? After a week of digging (these models take a long time to run), Joseph figured out that it had to do with a difference in the stopping criterion. But if such subtle implementation details can have such a big impact overall, what are we to make of performance comparisons between completely different systems implemented by different individuals?
As I argued above, if we want results which will generalize, we need a clear causal explanation for why one algorithm behaves differently from another. With large datasets where experiments take days or weeks, complex algorithms that require building on someone else’s code, and time pressure to beat everyone else’s numbers, there just isn’t enough time to run enough experiments to get a full understanding. With small datasets, it’s possible to chase down all the subtle issues and explain why things happen.
In recent years, neural nets have smashed lots of benchmarks, but it’s important to remember that this follows upon decades of empirical sleuthing on toy datasets. In fact, the toy datasets are still relevant: even in the Big Data Capital of the World, Geoff Hinton still runs experiments on MNIST. I think these datasets will continue to deliver insights for some time to come.