Learning Image Features from Video
Robert Nishihara · December 18, 2012
While at NIPS, I came across the paper Deep Learning of Invariant Features via Simulated Fixations in Video by Will Zou, Shenghuo Zhu, Andrew Ng, and Kai Yu. It proposes a particularly appealing unsupervised method for using videos to learn image features. Their method appears to be somewhat inspired by the human visual system. For instance, people have access to video data, not static images. They also attempt to mimic the human tendency to fixate on particular objects. They track objects through successive frames in order to provide more coherent data to the learning algorithm.
The authors use a stacked architecture, where each layer is trained by optimizing an embedding into a feature space. As usual, the optimization problem involves a reconstruction penalty and a sparsity penalty. In addition, however, it includes a temporal slowness penalty, which seeks to minimize the $L_1$ norm between the feature representations of consecutive frames. This enforces the intuition that good representations of images should change slowly as the images deform. Using this approach, the authors achieve improved performance on various classification tasks.
I suspect that even more information is contained in video data than the authors make use of. It makes sense that feature representations ought to change slowly, but they should also change consistently. For instance, an object rotating in one direction will likely continue to rotate in the same direction for several frames. If the lighting dims from one frame to another, it will probably continue to dim for several more frames. In other words, features should change slowly and smoothly. Such "slow and smooth" priors have been successfully used for motion estimation, and they capture something natural about the way images deform.
The authors encoded their intuition about slowly-changing features into the optimization problem by adding a term that penalizes the first differences of the representations. To build in the intuition about smoothly-changing features, we could add a term that penalizes the second differences of the representations. Of course, this opens the possibility of considering third and fourth differences as well, and it would be interesting to see if such higher derivatives give any added benefit.