Compressing genomes

[latexpage]

Here’s an interesting question: how much space would it take to store the genomes of everyone in the world? Well, there are about 3 billion base pairs in a genome, and at 2 bits per base (4 choices), we have 6 billion bits or about 750 MB (say we are only storing one copy of each chromosome). Multiply this by 7 billion people and we have about 4800 petabytes. Ouch! But we can do a lot better. Continue reading “Compressing genomes”

Data compression and unsupervised learning, Part 2

[latexpage]

This is a continuation of my last post about data compression and machine learning. In this post, I will start to address the question:

Does “good” compression generally lead to “good” unsupervised learning?

To answer this question, we need to start with another question:

What is a “good” compression algorithm?

Continue reading “Data compression and unsupervised learning, Part 2”

Data compression and unsupervised learning

Data compression and unsupervised learning are two concepts whose relationship is perhaps underappreciated. Compression and unsupervised learning are both about finding patterns in data — but, does the similarity go any further? I argue that it does. Continue reading “Data compression and unsupervised learning”