Compressing genomes


Here’s an interesting question: how much space would it take to store the genomes of everyone in the world? Well, there are about 3 billion base pairs in a genome, and at 2 bits per base (4 choices), we have 6 billion bits or about 750 MB (say we are only storing one copy of each chromosome). Multiply this by 7 billion people and we have about 4800 petabytes. Ouch! But we can do a lot better. Continue reading “Compressing genomes”

JIT compilation in MATLAB

A few years ago MATLAB introduced a Just-In-Time (JIT) accelerator under the hood. Because the JIT acceleration runs behind the scenes, it is easy to miss (in fact, MathWorks seems to intentionally hide it so that users do not change their coding style, probably because the JIT accelerator is changed regularly). I just wanted to briefly mention what a JIT accelerator is and what it does in MATLAB. Continue reading “JIT compilation in MATLAB”

Machine Learning Glossary

I often have a hard time understanding the terminology in machine learning, even after almost three years in the field. For example, what is a Deep Belief Network? I attended a whole summer school on Deep Learning, but I’m still not quite sure. I decided to take a leap of faith and assume this is not just because the Deep Belief Networks in my brain are not functioning properly (although I am sure this is a factor). So, I created a Machine Learning Glossary to try to define some of these terms. The glossary can be found here. I have tried to write in an unpretentious style, defining things systematically and leaving no “exercises to the reader”. I also have a form for readers to request new definitions. Continue reading “Machine Learning Glossary”

Data compression and unsupervised learning, Part 2


This is a continuation of my last post about data compression and machine learning. In this post, I will start to address the question:

Does “good” compression generally lead to “good” unsupervised learning?

To answer this question, we need to start with another question:

What is a “good” compression algorithm?

Continue reading “Data compression and unsupervised learning, Part 2”

Data compression and unsupervised learning

Data compression and unsupervised learning are two concepts whose relationship is perhaps underappreciated. Compression and unsupervised learning are both about finding patterns in data — but, does the similarity go any further? I argue that it does. Continue reading “Data compression and unsupervised learning”

Machine Learning that Doesn’t Matter

At ICML last year, Kiri Wagstaff (KW) delivered a plenary talk and accompanying paper entitled “Machine Learning that Matters.” KW, a researcher at the NASA Jet Propulsion Laboratory (JPL), draws attention to a number of very serious issues in the field but draws conclusions that differ from my own.

KW criticizes existing benchmark data sets such as the UCI data sets or the MNIST handwritten digit data set for being irrelevant or obsolete. I certainly agree that being state-of-the-art on MNIST is not necessarily important (see my last post for more discussion on the need carefully craft competitions based on benchmark data sets).

Continue reading “Machine Learning that Doesn’t Matter”

Healthy Competition?

Last week I attended the NIPS 2012 workshop on Connectomics: Opportunities and Challenges for Machine Learning, organized by Viren Jain and Moritz Helmstaedter. Connectomics is an emerging field that aims to map the neural wiring diagram of the brain. The current bottleneck to progress is analyzing the incredibly large (terabyte-petabyte range) data sets of 3d images obtained via electron microscopy. The analysis of the images entails tracing the neurons across images and eventually inferring synaptic connections based on physical proximity and other visual cues. One approach is manual tracing: at the workshop I learned that well over one million dollars has already been spent hiring manual tracers, resulting in data that is useful but many orders of magnitude short of even a very small brain.

The NIPS workshop was about using machine learning to speed up the process, and it consisted of talks, posters, and discussion. A previous workshop on this subject had a different flavor: it was a challenge workshop at ISBI 2012 (a similar idea to the Netflix challenge). To enter the challenge, anyone could download the training set and upload their results on the test data, which were then evaluated before the workshop (results here). At the NIPS workshop, the ISBI challenge was mentioned frequently, and scoring well on it seemed to be an important source credibility. Such a challenge can have a profound impact on the field, but is it a positive impact? Continue reading “Healthy Competition?”