The count-min sketch is a time- and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n-grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation.
After several wonderful years at Harvard, and some fun times at Twitter and Google, I’ve moved to Princeton. I’ll miss all my amazing colleagues at Harvard and MIT, but I’m excited for the unique opportunities Princeton has to offer. I’ve renamed the group from the “Harvard Intelligent Probabilistic Systems” (HIPS) group to the “Laboratory for Intelligent Probabilistic Systems” (LIPS). (I should’ve listened to the advice of not putting the name of the university in the group name…) I’ve moved all the HIPS blog posts over to this new WordPress site, but I will keep the HIPS Github as that is where some well-known projects live, such as Autograd and Spearmint. For new projects, I’ve created a new repository at https://github.com/PrincetonLIPS. … Read More
One approach to AI research is to work directly on applications that matter — say, trying to improve production systems for speech recognition or medical imaging. But most research, even in applied fields like computer vision, is done on highly simplified proxies for the real world. Progress on object recognition benchmarks — from toy-ish ones like MNIST, NORB, and Caltech101, to complex and challenging ones like ImageNet and Pascal VOC — isn’t valuable in its own right, but only insofar as it yields insights that help us design better systems for real applications. So it’s natural to ask: which research results will generalize to new situations?
When we talk about priors and regularization, we often motivate them in terms of “incorporating knowledge” or “preventing overfitting.” In a sense, the two are equivalent: any prior or regularizer must favor certain explanations relative to others, so favoring one explanation is equivalent to punishing others. But I’ll argue that these are two very different phenomena, and it’s useful to know which one is going on.
In this post, I’ll summarize one of my favorite papers from ICML 2013: Fast Dropout Training, by Sida Wang and Christopher Manning. This paper derives an analytic approximation to dropout, a randomized regularization method recently proposed for training deep nets that has allowed big improvements in predictive accuracy. Their approximation gives a roughly 10-times speedup under certain conditions. Much more interestingly, the authors also show strong connections to existing regularization methods, shedding light on why dropout works so well.
This is the second of two posts based on a testing tutorial I’m writing with David Duvenaud. In my last post, I talked about checking the MCMC updates using unit tests. Most of the mistakes I’ve caught in my own code were ones I caught with unit tests. (Naturally, I have no idea about the ones I haven’t caught.) But no matter how thoroughly we unit test, there are still subtle bugs that slip through the cracks. Integration testing is a more global approach, and tests the overall behavior of the software, which depends on the interaction of multiple components.
Here’s an interesting question: how much space would it take to store the genomes of everyone in the world? Well, there are about 3 billion base pairs in a genome, and at 2 bits per base (4 choices), we have 6 billion bits or about 750 MB (say we are only storing one copy of each chromosome). Multiply this by 7 billion people and we have about 4800 petabytes. Ouch! But we can do a lot better.
This post is taken from a tutorial I am writing with David Duvenaud. Overview When you write a nontrivial piece of software, how often do you get it completely correct on the first try? When you implement a machine learning algorithm, how thorough are your tests? If your answers are “rarely” and “not very,” stop and think about the implications. There’s a large literature on testing the convergence of optimization algorithms and MCMC samplers, but I want to talk about a more basic problem here: how to test if your code correctly implements the mathematical specification of an algorithm.
The proof and intuition presented here come from this excellent writeup by Yuval Filmus, which in turn draws upon ideas in this book by Fumio Hiai and Denes Petz. Suppose that we have a sequence of real-valued random variables . Define the random variable (1) to be a scaled sum of the first variables in the sequence. Now, we would like to make interesting statements about the sequence (2)