So it’s natural to ask: which research results will generalize to new situations?

One kind of result which probably won’t generalize is: “algorithm A works better than algorithm B.” Different application areas have their own requirements. Even for a single task like object recognition, different datasets have different properties which affect how well different approaches perform. We’ve learned some things from seeing how different approaches perform in aggregate across a wide range of benchmarks; for instance, random forests are a very good general purpose classifier, and ensembles work much better than single classifiers. Benchmarks are also good for identifying talented individuals who companies should hire. But when it comes to improving our algorithms, I think it’s hard to learn very much by comparing the final performances of different approaches.

The kind of result I believe generalizes to new situations is the nature of the tradeoffs between different approaches. Consider some of what we know about neural nets and deep learning. Compared with contrastive divergence, persistent contrastive divergence gives more accurate samples, at the cost of higher variance updates. Tempered transitions is more accurate still, but much more expensive. Hessian-free optimization is better than stochastic gradient descent at accounting for curvature, but it is much harder to implement and each iteration is far more expensive. Dropout attenuates overfitting, but at the expense of higher variance in the gradients.

None of these approaches is uniformly better than its alternatives. Datasets vary a lot in factors like size, complexity, auxiliary information sources, and labeling noise which affect the tradeoffs between algorithms. For larger datasets, regularization may be less of a concern, and computation time more critical. Computational resources also make a difference, e.g. GPUs have shifted the balance in favor of dense matrix multiplies. All of these factors are likely to change between the benchmark and the practical setting, so the arg max over all the algorithmic choices is likely to change as well. (This is why Netflix never wound up using the prize-winning algorithm.) But the nature of the tradeoffs themselves seems to hold up pretty well.

Knowing the tradeoffs doesn’t give a recipe for machine learning, but it does give a strategy for designing algorithms. Researchers have generally worked on several problems or datasets, and we know which algorithms work well and which factors are important on these datasets. Using these as reference points, we think about how the new problem differs (is it larger? noisier?), and that tells us in which directions to adjust. From this starting point, we can run diagnostics to tell us which issues are hurting performance and what useful sources of information are our methods overlooking, and this will inform our choice of algorithms.

This is part of why companies are spending so much money to hire top machine learning researchers. Quite a lot is known about machine learning, but the knowledge is still highly context dependent. (It’s a good thing for us researchers that “A is better than B” results don’t generalize; otherwise, there would be One Algorithm to Rule them All, and little room for new insight.)

It may seem obvious that if we can’t replicate the real world in our benchmarks, we should at least try to get as close as possible. Shouldn’t we be working with large image datasets as representative as possible of the distribution of all images? Shouldn’t we build the very best performing algorithms that we can on these problems, so that they will be close to what’s used in practice? The problem is, the larger the dataset and the more complex the algorithm, the harder it is do to science.

Brian Kernighan once said, “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” But running careful and meaningful experiments is harder still than debugging. You need to vary a lot of factors and measure their effect, which requires running the algorithm many more times than if you simply wanted to use it. Controlling for confounding factors requires a clear understanding of how everything fits together. If your algorithm is at the frontier of what you can implement, or of what can be run on modern computing technology, then how will you ever run experiments with it?

Trying to bust the hardest benchmarks usually pushes us to the frontier of what our brains and our computers can handle. This reflects itself in the sorts of experiments we run. If you look through recent deep learning or Bayesian nonparametrics papers which use toy datasets, you’ll find carefully controlled experiments which vary one factor and show that this makes a difference. Authors are expected not just to show that their method works better, but also to present evidence for why it works better. But with the most challenging benchmarks, authors typically compare their final performance against numbers previously published in the literature. These numbers were obtained using a completely different system, and it’s often hard to understand the reason for the improvement.

I briefly worked on an object detection project with Joseph Lim, using the Pascal VOC dataset. Like most people working on object detection at the time, we built on top of Pedro Felzenswalb’s deformable part model software package. For reasons I won’t go into, we tried replacing the SVM package it used with a different SVM package, and this led to a huge drop in performance. This seemed nonsensical — both packages were optimizing the same convex objective, so shouldn’t they be interchangeable? After a week of digging (these models take a long time to run), Joseph figured out that it had to do with a difference in the stopping criterion. But if such subtle implementation details can have such a big impact overall, what are we to make of performance comparisons between completely different systems implemented by different individuals?

As I argued above, if we want results which will generalize, we need a clear causal explanation for why one algorithm behaves differently from another. With large datasets where experiments take days or weeks, complex algorithms that require building on someone else’s code, and time pressure to beat everyone else’s numbers, there just isn’t enough time to run enough experiments to get a full understanding. With small datasets, it’s possible to chase down all the subtle issues and explain why things happen.

In recent years, neural nets have smashed lots of benchmarks, but it’s important to remember that this follows upon decades of empirical sleuthing on toy datasets. In fact, the toy datasets are still relevant: even in the Big Data Capital of the World, Geoff Hinton still runs experiments on MNIST. I think these datasets will continue to deliver insights for some time to come.

]]>

The first lecture of every machine learning class has to include a cartoon like the following to explain overfitting:

This is more or less what happens if you are using a good and well-tuned learning algorithm. Algorithms can have particular pathologies that completely distort this picture, though.

Consider the example of linear regression without a regularization term. I generated synthetic data with \(D = 100\) feature dimensions, with everything drawn from Gaussian priors. Recall that the solution is given by the pseudoinverse formula \(X^\dagger y\), i.e. from the set of points which minimize error on the training set, it chooses the one with the smallest norm. Here’s what actually happens to the training and test error as a function of the number of training examples \(N\) (the dotted line shows Bayes error):

Weird — the test error actually increases until \(N\) roughly equals \(D\). What’s going on?

When \(N \ll D\), it’s really easy to match all the training examples. E.g., when N = 1, there’s a \(D-1\) dimensional affine space which does this. But as \(N\) becomes closer to \(D\), it gets much harder to match the training examples, and the algorithm goes crazy trying to do it. We can see this by looking at the norm of the weight vector \(w\):

So really, the large error is caused by the algorithm doing something silly: choosing a really large weight vector in order to match every training example. Now let’s see what happens when we add an L2 regularization term \(\lambda \|w\|^2\). Here are the results for several different choices of lambda:

The effect disappears! While the different \(\lambda\) values occupy different points on the overfitting/underfitting tradeoff curve, they all address the basic issue: the regression algorithm no longer compulsively tries to match every example.

You might ask, doesn’t this regularizer encode information about the solution, namely that it is likely to be near zero? To see that the regularization term really isn’t about this kind of knowledge, let’s concoct a regularizer based on misinformation. Specifically, we’ll use an L2 regularization term just as before, but instead of centering it at zero, let’s center it at \(-w\), the exact opposite of the correct weight vector! Here’s what happens:

This is indeed a lousy regularizer, and for small training set sizes, it does worse than having no regularizer at all. But notice that it’s still enough to eliminate the pathology around \(N=D\), and it continues to outperform the unregularized version after that point.

Based on this, I would argue that for linear regression, L2 regularization isn’t encoding knowledge about where good solutions are likely to be found. It’s encoding knowledge about how the algorithm tends to misbehave when left to its own devices.

Usually, priors and regularizers are motivated by what they encourage rather than what they prevent, i.e. they “encourage smoothness” or “encourage sparsity.” One interesting exception is Hinton et al.’s paper which introduced the dropout trick. Consider the title: “Improving neural networks by *preventing co-adaptation* of feature detectors.” Co-adaptation refers to a situation where two units representing highly correlated features wind up with opposing weights. Their contributions wind up mostly canceling, but the difference may still help the network fit the training set better. This situation is unstable, because the pair of units can wind up behaving very differently if the data distribution changes slightly. The dropout trick is to randomly turn off 50% of the units on any given iteration. This prevents co-adaptation from happening by making it impossible for any two units to reliably communicate opposite signals.

Next time you try to design a prior or regularizer, think about which you’re doing: are you trying to incorporate prior knowledge about the solution, or are you trying to correct for the algorithm’s pathologies?

]]>

The idea behind dropout is very simple: during each evaluation of the neural net during training, half the inputs and hidden units are randomly ‘dropped’ (set to zero). This simple technique allowed large improvements in predictive accuracy for deep nets. However, it was not immediately clear why it works so well. The original paper gives several intuitions about why dropout works. One proposed explanation is that individual hidden units are discouraged from developing complex (and possibly brittle) dependencies on each other. We’ll return to that question later.

The key insight in the fast dropout paper is this: If the input to each node in a neural network is a weighted sum of its inputs – and if some of those inputs are randomly being set to zero, then the total input is actually a weighted sum of Bernoulli random variables. By the Central Limit Theorem, this sum can be well-approximated by a Gaussian when a neuron has many inputs with comparable variance. The authors derive the mean and variance of these Gaussians, and can then approximately integrate over all exponentially-many combinations of dropouts in a one-layer network. They also approximate the output of each neuron with a Gaussian, by locally-linearizing the nonlinearity in each neuron, and derive deterministic update rules for the multi-layer case. This trick leads to a roughly 10-times speedup in training times. More interesting than the speedup, though, is that this approximate dropout technique is equivalent in some ways to standard approaches.

The first equivalence shown is: if you normalize your data features to all have the same variance, then ridge regression is exactly equivalent to least-squares linear regression using dropout! This means that dropout is a regularization technique that’s invariant to rescaling of the inputs, which may be one reason why it works so well. In deep nets, this may be especially helpful, since the scale of the hidden node activations may change during training.

The second equivalence shown is that a one-layer neural network trained using dropout is approximately optimizing a lower bound on the model evidence of a logistic regression model. This result suggests that dropout is closely related to expectation-maximization (or, more generally, variational Bayes). The original dropout paper mentions a similar result – for one-layer networks, dropout approximately optimizes the geometric mean of the likelihood of all possible ways of removing nodes from the network – but the explicit derivations in the fast dropout paper make the link much clearer.

So, this paper gives us two new ways to think about dropout – as a scale-invariant regularizer, and as an approximate variational method. The success of dropout suggests that, even for large datasets, regularization of deep nets is still very important. Perhaps we should take a second look at variational inference in deep Bayesian neural networks, or further investigate scale-invariant regularization methods.

Thanks to Oren Rippel and Roger Grosse for helpful comments.

Update: Kevin Swersky pointed out a follow-up paper with even more connections: Dropout Training as Adaptive Regularization

]]>In my last post, I talked about checking the MCMC updates using unit tests. Most of the mistakes I’ve caught in my own code were ones I caught with unit tests. (Naturally, I have no idea about the ones I haven’t caught.) But no matter how thoroughly we unit test, there are still subtle bugs that slip through the cracks. Integration testing is a more global approach, and tests the overall behavior of the software, which depends on the interaction of multiple components.

When testing samplers, we’re interested in testing two things: whether our algorithm is mathematically correct, and whether it is converging to a correct solution. These two goals are somewhat independent of each other: a mathematically correct algorithm can get stuck and fail to find a good solution, and (counterintuitively) a mathematically incorrect algorithm can often find the correct solution or at least get close. This post is about checking mathematical correctness.

The gold standard for testing MCMC algorithms is the Geweke test, as described in Geweke’s paper “Getting it right: joint distribution tests of posterior simulators.” [1] The basic idea is simple: suppose you have a generative model over parameters \(\theta\) and data \(x\), and you want to test an MCMC sampler for the posterior \(p(\theta | x)\). There are two different ways to sample from the joint distribution over \(\theta\) and \(x\). First, you can forward sample, i.e. sample \(\theta\) from \(p(\theta)\), and then sample \(x\) from \(p(x | \theta)\). Second, you can start from a forward sample, and run a Markov chain where you alternate between

- Updating \(\theta\) using one of your MCMC transition operators, which should preserve the distribution \(p(\theta | x) \), and
- resampling the data from the distribution \(p(x | \theta) \).

Since each of these operations preserves the stationary distribution, each step of this chain is a perfect sample from the joint distribution \(p(\theta, x)\).

If your sampler is correct, then each of these two procedures should yield samples from exactly the same distribution. You can test this by comparing a variety of statistics of the samples, e.g. the mean of \(x\), the maximum absolute value of \(\theta\), and so on. No matter what statistics you choose, they should be indistinguishable between the two distributions.

The beauty of this approach is that it amplifies subtle bugs in the sampler. For instance, suppose we have a bug in our sampler which makes the posterior noise variance 1% too large. This would be nearly impossible to detect by simply running your model on real data and seeing if the results look plausible. Now consider what the Geweke test would do. After the parameters are sampled, the data will be generated with a 1% larger noise variance. When the parameters are resampled, the noise variance will be 1% too large on top of that, or 2%. Pretty soon, the noise variance will explode.

In his paper, Geweke uses frequentist hypothesis tests to compare the distributions. This can be difficult, however, since you’d need to account for dependencies between the samples in order to determine significance. A simpler approach is to make a P-P plot of the statistics. If it’s very close to a straight line, the test “passes”:

If it’s far off, you have a bug:

If the results are noisy and unclear, generate more forward samples and run the chains for longer:

(While these figures are synthetic, they’re representative of outputs I’ve gotten in the past.) If you wind up with the third case, and you can’t run the sampler any longer because it’s already too slow, try reducing the number of data points. Not only does this speed up each iteration, but it also makes the chains mix faster — the more data points there are, the stronger the coupling between \(\theta\) and \(x\).

There’s a major drawback of the Geweke test, unfortunately: it gives you no indication of where the bug might be. All you can do is keep staring at your code till you find the bug. Therefore, there’s no point in running this test until all your unit tests already pass. This meshes with the general test-driven development methodology, where you get your unit tests to pass before the integration tests.

[1] J. Geweke. Getting it right: joint distribution tests of posterior simulators. JASA, 2004.

]]>Here’s an interesting question: how much space would it take to store the genomes of everyone in the world? Well, there are about 3 billion base pairs in a genome, and at 2 bits per base (4 choices), we have 6 billion bits or about 750 MB (say we are only storing one copy of each chromosome). Multiply this by 7 billion people and we have about 4800 petabytes. Ouch! But we can do a lot better. We know that genomes of different humans are very similar; let’s say 99.9% similar. For the calculations that follow I’ll make the simplifying assumption that there exists some prototype genome, and that all other human genomes differ from it by exactly 0.1%, that is to say they differ from it at exactly 3 million locations. I will also assume that all allowed genomes are equally likely.

How do we use this assumption to compress genomes? First, consider the following simple scheme: we store the prototype genome, and then for each new genome, we store the locations of the changes, and the changes themselves. Since $2^{32}$ is about 4 billion, a regular 32 bit integer is enough to store the change locations. Then let’s use an extra 2 bits to store the actual new base, for a total of 34 bits per change. With 3 million changes, this gives us a size of about 3 million $\times$ 34 bits = 13 MB per genome. That’s already a lot better than 750 MB. But, can we do better? (Notation pit-stop: I will use the shorthand M for million and B for billion… but MB is still megabyte! I will use $\lg$ to mean log base 2.)

Luckily, Shannon figured out the theoretical limit on how much we can compress things. According to Shannon, since every allowed genome is assumed to be equally likely, all we need to do is count the number of allowed genomes and then take the log. The number of possible sets of change locations is $\binom{3B}{3M}$, or “3 Billion choose 3 Million”, because we need to pick 3M change locations of out 3B possible locations. For each of these sets, we need to choose the new bases themselves, and there are 3 choices per base. So the total number of possible allowed genomes is \[\binom{3B}{3M} 3^{3M} = \frac{3B!}{3M! (3B-3M)!} 3^{3M}= \frac{3B!}{3M! 2.997B!} 3^{3M}\] Now, we take the log of this to get the minimum file size: \[\lg{3B!}-\lg{3M!}-\lg{2.997B!}+3M\lg{3}\]Using Stirling’s formula, $\log{n!}\approx n\log{n}$, this gives us \[3B\lg{3B}-3M\lg{3M}-2.997B\lg{2.997B}+3M\lg{3}\](Note: I only kept the first term in Stirling’s approximation. It turns out that the next term cancels out in this case, so I skipped it for the sake of cleanliness.) From here we can just compute the result with a normal calculator and we get about 39 million bits, or about 5 MB.

What does this mean? When we took advantage of the similarity of genomes, we compressed from 750 MB to 13 MB (about 90 PB for 7 billion people, down from 4800 PB). But the calculation above says that the theoretical limit is 5 MB. Why is our algorithm 8 MB shy of the optimum? In other words, from an intuitive point of view, where is the “slack” in our method? You might enjoy thinking about this for a minute before reading on.

I have identified 3 sources of slack in the above algorithm.

- We used 2 bits for every base change, but actually there are only 3 possibilities for a change, not 4. So, theoretically, we only need $\lg{3}\approx 1.6$ bits per change, instead of 2 bits.
- (a) We used 32 bits to store each change location (which is a number between 1 and 3 billion), but actually we only need $\lg(3B)\approx 31.5$ bits. (b) Furthermore, we don’t even need this many $(\lg 3B)$ bits every time: the first time we need a number between 1 and 3B, but next time we only need to choose between 3B-1 locations, and then 3B-2, etc. since there are no repeats.
- The set of changes is sent in a particular order, which is irrelevant. We could permute the list of changes, and the resulting genome would be the same.

The next question is: what is the size of each of these effects? I will address these in the same order as they appear above.

- Using 2 bits instead of $\lg{3}$ bits per change has a total waste per genome of $3M\cdot (2-\lg{3})$ which is about 1.2 million bits or
**0.15 MB**. - (a) Using 32 bits instead of $\lg{3B}\approx 31.5$ bits per change has a total waste per genome of $3M (32-\lg{3B})$, which is about 1.6 million bits or
**0.19 MB**.

(b) The size of effect 2(b) is the difference between sending $\lg{3B}$ bits 3 million times and the slightly more optimal version of sending $\lg{3B}$ bits, then $\lg{(3B-1)}$, and so on until $\lg{2.997B}$ bits for the last change location. Mathematically, this difference is \[(3M\cdot\lg{3B})-(\lg{3B}+\lg{(3B-1)}+ \ldots + \lg{2.997B})\]This effect is very tiny, only about 2000 bits or**0.00025 MB**. - Sending an ordered list when ordering is not necessary means that all possible orderings of the 3M changes produce the same result. Since there are $3M!$ possible orderings of 3M changes, there is a redundancy factor of $3M!$ in this method. Thus the ordering has a waste of $\lg{3M!}$ bits. Here I again need Stirling’s formula, and this time I will keep the second term because it does not cancel. Stirling’s formula says\[\log{n!}\approx n\log{n}-n=n\log{(n/e)}\]We can again change this to log base 2 because the correction factor cancels from both sides, so we have\[\lg{3M!}\approx 3M\lg{(3M/e)}\]which is about 60 million bits or
**7.5 MB**.

Let’s take a moment to interpret these results. First, it’s interesting that almost all of the slack comes from the ordering issue, slack source #3. This was not very intuitive to me at first; it seems like such an abstract concept that “we send things in some particular order but actually the order doesn’t matter”, but this “abstract” concept costs us 7.5 MB, or about 52 petabytes for all 7 billion genomes. Hopefully this issue will become more clear in the following paragraphs.

Second, did we find all of the slack? The original algorithm was 13 MB (actually 12.75 MB) and the theoretical limit was 5 MB (actually 4.87 MB). We found a total of 0.15 MB + 0.19 MB + 0.00025 MB + 7.53 MB = 7.87 MB of slack. Add this to the theoretical limit of 4.87 MB and we get 12.74 MB. This seems plausibly within rounding error of 12.75 MB, which is great! But, maybe we still missed something?

The answer is that we did not miss anything. Below, I will show definitively that we found all the slack. In particular, I will show how, starting with the mathematical expression for the theoretical minimum, $\lg{3B!}-\lg{3M!}-\lg{2.997B!}+3M\lg{3}$ bits, we can act on it with each slack source in turn and end up with the exact mathematical expression for our algorithm, $3M\cdot32 + 3M\cdot2$ bits. Here we go… [it is possible to skip the next 3 paragraphs if you believe me and do not need to see the calculations]

First, we apply slack source 1, which connects the $3M\lg{3}$ term to the $3M\cdot 2$ term. Slack source 1 says that there are only 3 choices for a change of base, not 4. This corresponds exactly to changing the $\lg{3}$ to $\lg{4}=2$ for each change, or $3M\lg{3}$ to $3M\lg{4}=3M\cdot 2$ bits in total.

Next we apply slack source 2, which connects $\lg\frac{3B!}{2.997B!}$ to $3M\cdot 32$. We start with slack source 2(b). In the theoretical calculation we started with the ratio $\frac{3B!}{2.997B!}$, which is the same as $3B\times (3B-1)\times … \times 2.997B$. This product is fairly similar to $3B\times 3B\times … \times 3B$, or $(3B)^{3M}$. Indeed, the mathematical approximation $\frac{3B!}{2.997B!}\approx3B^{3M}$ *exactly* corresponds to slack 2(b): it is the intuitive approximation that it is OK to just send $\lg{3B}$ bits per change location, instead of the slightly optimized version described above. Then, from there, $\lg{(3B)^{3M}}=3M\lg{3B}$, and effect 2(a) above says that we approximate $\lg{3B}$ as 32. When we apply this approximation, we get $3M\cdot 32$, which is exactly the term we were looking for.

Finally, we apply slack source 3. The only term left in the theoretical formula is the $3M!$ in the denominator, which has no corresponding term in the formula for our algorithm. This is exactly slack source 3, as described earlier! The slack due to ordering has a redundancy of $3M!$ possible orderings, and when we divide this out in the “choose” formula, this exactly corresponds to saying that, in theory, we do not need ordering, so we divide these permutations out, thus reducing the theoretical limit by exactly $\lg{(3M!)}$ bits.

To recap: I have now shown that the formula for the size of our algorithm can be obtained by starting with the theoretical formula and applying a series of mathematical approximations, and furthermore that each of these approximations exactly corresponds to one of the intuitive slack sources described above. Thus, we have “found” all the slack; i.e. we have an intuitive understanding of all the reasons why our method for storing genomes compresses less than the theoretical limit.

I want to add one last chapter to the story, namely to address the obvious question: “What is a better algorithm that performs closer to the theoretical limit?” One intuitive answer is the following: instead of encoding the change locations explicitly, encode the distances (“diffs”) between subsequent change locations. For example, instead of transmitting that there are changes at locations 195, 200, and 240, just encode “195”, “5”, and “40”. Intuitively, these diffs will be small, on average 1000 or so. By transmitting smaller numbers, we can save bits. Using our newfound intuition, we can also say definitively that encoding the diffs eliminates the order-based slack, because the diffs must be sent in the proper order for the answer to be correct. [Note: the discussion below about diffs is meant to illustrate the intuitions described above, and is not necessarily the best solution. To solve this problem more optimally, I would probably use arithmetic coding. A new method for compressing unordered sequences is described in this paper.]

To make this more concrete, I propose the following algorithm: sort the locations of all the changes and find the diffs. Then, use $n<32$ bits to encode each diff. If a particular diff requires more than $n$ bits (i.e., is $\geq 2^n$), then we will send a special symbol of $n$ zeros, followed by the full 32-bit representation of the diff. How do we choose $n$? Intuitively, if we pick $n$ too small then almost all of the diffs will overflow and we will end up sending them as full 32-bit representations, gaining nothing. On the other hand, if $n$ is too large then we don’t get significant gains because the diffs are sent with more bits than needed. Thus we can expect some middle ground that gives the best performance.

To test this, I implemented a codec using this method. I first generated random genomes using the assumptions above, then compressed them with this method and plotted the performance as a function of $n$. The results are shown in the figure below. (Code is posted here.)

The cyan line is the theoretical limit, at about 4.9 MB. The red line shows just the size of the diffs represented with $n$ bits. The blue line shows just the size of the diffs represented with $n+32$ bits (the “overflows”). The black line is the sum of the red line, the blue line, and the $3M\cdot 2$ bits used for the new bases (as you can see, I didn’t bother addressing slack source 1, but at least now I know it only costs me 0.15 MB per genome). What we see is just what we expected: things get worse, and then they get better. In this case we see the best value is $n=12$, and its size is about 5.4 MB, quite close to the theoretical limit.

To improve a diff-based system further, one would have to take into account the probability distribution of the diffs. Below is an empirical histogram of this distribution from one random genome I generated. (Quick sanity check: the picture is approximately a triangle with base length 2000 and height 0.001, so its area is plausibly equal to 1). Intuitively, the algorithm above is optimal for a distribution that is different from this one. In particular, the algorithm “thinks” the distribution is a piece-wise constant function with two pieces, one higher-probability piece between 1 and $2^n-1$ and another, lower-probability piece from $2^n$ to 3B. Changing $n$ changes the shapes of these steps, and presumably choosing $n=12$ maximizes the “overlap” between the implied piece-wise distribution and the real distribution (approximately) shown here.

Comments, questions, or corrections? Please email me! My email is listed near the top of my website.

]]>When you write a nontrivial piece of software, how often do you get it completely correct on the first try? When you implement a machine learning algorithm, how thorough are your tests? If your answers are “rarely” and “not very,” stop and think about the implications.

There’s a large literature on testing the convergence of optimization algorithms and MCMC samplers, but I want to talk about a more basic problem here: how to test if your code correctly implements the mathematical specification of an algorithm. There are a lot of general strategies software developers use to test code, but machine learning algorithms, and especially sampling algorithms, present their own difficulties:

- The algorithms are stochastic, so there’s no single “correct” output
- Algorithms can fail because they get stuck in local optima, even if they are mathematically correct
- Good performance is often a matter of judgment: if your algorithm gets 83.5% of the test examples correct, does that mean it’s working?

Furthermore, many machine learning algorithms have a curious property: they are robust against bugs. Since they’re designed to deal with noisy data, they can often deal pretty well with noise caused by math mistakes as well. If you make a math mistake in your implementation, the algorithm might still make sensible-looking predictions. This is bad news, not good news. It means bugs are subtle and hard to detect. Your algorithm might work well in some situations, such as small toy datasets you use for validation, and completely fail in other situations — high dimensions, large numbers of training examples, noisy observations, etc. Even if it only hurts performance by a few percent, that difference may be important.

If you’re a researcher, there’s another reason to worry about this. Your job isn’t just to write an algorithm that makes good predictions — it’s to run experiments to figure out how one thing changes as a result of varying another. Even if the algorithm “works,” it may have funny behaviors as some parameterizations compensate for mistakes better than others. You may get curious results, such as performance worsening as you add more training data. Even worse, you may get plausible but incorrect results. Even if these aren’t the experiments you eventually publish, they’ll still mess up your intuition about the problem.

In the next two blog posts, I’ll focus on testing MCMC samplers, partly because they’re the kind of algorithm I have the most experience with, and partly because they are especially good illustrations of the challenges involved in testing machine learning code. These strategies should be more widely applicable, though.

In software development, it is useful to distinguish two different kinds of tests: unit tests and integration tests. Unit tests are short and simple tests which check the correctness of small pieces of code, such as individual functions. The idea is for the tests to be as local as possible — ideally, a single bug should cause a single test to fail. Integration tests test the overall behavior of the system, without reference to the underlying implementation. They are more global than unit tests: they test whether different parts of the system interact correctly to produce the desired behavior. Both kinds of tests are relevant to MCMC samplers.

A development pattern called test-driven development is built around testability: the idea is to first write unit and integration tests as a sort of formal specification for the software, and then to write code to make the tests pass. A fully test-driven development process might look something like:

- write integration tests based on high-level specs for the system
- write unit tests
- write code which makes the unit tests pass
- make sure the integration tests pass

In research, because the requirements are so amorphous, it is often difficult to come up with good specs in advance. However, I’ve found the general idea of getting unit tests to pass, followed by integration tests, to be a useful one. I’ll talk about unit tests in this blog post and integration tests in the followup.

Unit testing is about independently testing small chunks of code. Naturally, if you want to unit test your code, it needs to be divided into independently testable chunks, i.e. it needs to be modular. Such a modular design doesn’t happen automatically, especially in research, where you’re building up a code base organically with constantly shifting goals and strategies. It can be difficult to retrofit to code base with unit tests; in fact, the book *Working effectively with legacy code* defines legacy code as “code without unit tests.” The whole book is about strategies for refactoring an existing code base to make it testable.

It’s much easier to try to keep things modular and testable from the beginning. In fact, one of the arguments behind test-driven development is that by writing unit tests along with the actual code, you force yourself to structure your code in a modular way, and this winds up having other benefits like reusability and maintainability. This agrees with my own experience.

Unfortunately, the way machine learning is typically taught encourages a programming style which is neither modular nor testable. In problem sets, you typically work through a bunch of math to derive the update rules for an iterative algorithm, and then implement the updates directly in MATLAB. This approach makes it easy for graders to spot differences from the correct solution, and you’ll probably be able to write a correct-looking homework assignment this way, but it’s not a robust way to build real software projects. Iterative algorithms are by nature difficult to test as black boxes.

Fortunately, most machine learning algorithms are formulated in terms of general mathematical principles that suggest a modular organization into testable chunks. For instance, we often formulate algorithms as optimization problems which can be solved using general techniques such as gradient descent or L-BFGS. From an implementation standpoint, this requires writing functions to (a) evaluate the objective function at a point, and (b) compute the gradient of the objective function with respect to the model parameters. These functions are then fed to library optimization routines. The gradients can be checked using finite difference techniques. Conveniently, most scientific computing environments provide implementations of these checks; for instance, scipy.optimize.check_grad (in Python) or Carl Rasmussen’s checkgrad function (in MATLAB).

This organization into gradient computations and general purpose optimization routines exemplifies a generally useful pattern for machine learning code: separate out the implementations of the model and general-purpose algorithms. Consider, for instance, writing a Gibbs sampler for a distribution. Seemingly the most straightforward way to implement a Gibbs sampler would be to derive by hand the update rules for the individual variables and then write a single iterative routine using those rules. As I mentioned before, such an approach can be difficult to test.

Now consider how we might separate out the model from the algorithm. In each update, we sample a variable \(X\) from its conditional distribution \(p(X|Z)\). It is difficult to directly test the correctness of samples from the conditional distribution: such a test would have to be slow and nondeterministic, both highly undesirable properties for a unit test to have. A simpler approach is to test that the conditional distribution is consistent with the joint distribution. In particular, for any two values \(X\) and \(X^\prime\), we must have:

\(\frac{p(X^\prime | Z)}{p(X | Z)} = \frac{p(X^\prime, Z)}{p(X, Z)}\)

Importantly, this relationship must hold *exactly* for *any* two values \(X\) and \(X^\prime\). For the vast majority of models we use, if our formula for the conditional probability is wrong, then this relationship would almost surely fail to hold for two randomly chosen values. Therefore, a simple and highly effective test for conditional probability computations is to choose random values for \(X\), \(X^\prime\), and \(Z\), and verify the above equality to a suitable precision, such as \(10^{-10}\). (As usual in probabilistic inference, we should use log probabilities rather than probabilities for numerical reasons.)

In terms of implementation, this suggests writing three separate modules:

- classes representing probability distributions which know how to (1) evaluate the log probability density function (or probability mass function) at a point, and (2) generate samples
- a specification of the model in terms of functions which compute the probability of a joint assignment and functions which return conditional probability distributions
- a Gibbs sampling routine, which sweeps over the variables in the model, replacing each with a sample from its conditional distribution

The distribution classes require their own forms of unit testing. For instance, we may generate large numbers of samples from the distributions, and then ensure that the empirical moments are consistent with the exact moments. I’ll defer discussion of testing the Gibbs sampler itself to my next post.

While this modular organization was originally motivated by testability, it also has benefits in terms of reusability: it’s easy to reuse the distribution classes between entirely different projects, and the conditional probability functions can be reused in samplers which are more sophisticated than Gibbs sampling.

While Gibbs sampling is a relatively simple example, this thought process is a good way of coming up with modular and testable ways to implement machine learning algorithms. With more work, it can be extended to trickier samplers like Metropolis-Hastings. The general idea is to make a list of mathematical properties that the component functions should satisfy, and then check that they actually satisfy those properties.

There are a lot of helpful software tools for managing unit tests. I use nose, which is a simple and elegant testing framework for Python. In order to avoid clutter, I keep the tests in a separate directory whose structure parallels the main code directory.

I’ve found that unit testing of the sort that I’ve described takes very little time and provides substantial peace of mind. It still doesn’t guarantee correctness of the full algorithm, of course, so in the next post I’ll talk about integration testing.

]]>

\begin{equation}

X_1, X_2, \ldots .

\end{equation}

Define the random variable

\begin{equation}

A_N = \frac{X_1 + \cdots + X_N}{\sqrt{N}}

\end{equation}

to be a scaled sum of the first $N$ variables in the sequence. Now, we would like to make interesting statements about the sequence

\begin{equation}

A_1, A_2, \ldots .

\end{equation}

The central limit theorem is quite general. To simplify this exposition, I will make a number of assumptions. First, I will assume that the $X_i$ are independent and identically distributed. Second, I will assume that each $X_i$ has mean $\mathbb E[X_n] = 0$ and variance $\textnormal{var}(X_n) = \sigma^2$. Finally, I will assume that the moment generating function (to be defined below) converges (this condition requires all moments of the distribution to exist).

Under these conditions, the central limit theorem tells us that

\begin{equation}

A_1, A_2, \ldots \to \mathcal N(0,\sigma^2) ,

\end{equation}

where $N(\mu, \sigma^2)$ is the normal distribution with density function

\begin{equation}

\mathcal N(x ; \mu, \sigma^2) = \frac{1}{ \sigma \sqrt{2 \pi}} \exp\left(-\frac{ (x-\mu)^2}{2\sigma^2}\right) ,

\end{equation}

and where $P_1 \to P_2$ means that $P_1([a,b]) \to P_2([a,b])$ for all intervals $[a,b] \subset \mathbb R$. It is not immediately obvious that the sequence $\{A_n\}$ should converge, but if it does converge, the normal distribution is the natural candidate. Suppose it converges to some distribution $D$. Now, consider the random variable

\begin{equation}

B_N = \frac{1}{\sqrt{2}} \left(\frac{X_1 + \cdots + X_N}{\sqrt{N}} + \frac{X_{N+1} + \cdots X_{2N}}{\sqrt{N}} \right) .

\end{equation}

Of course, $B_N$ is just $A_{2N}$, so $B_N \to D$. But the first term on the RHS is $A_N$, and the second term on the RHS has the same distribution as $A_N$, so we have

\begin{equation}

B_N \to \frac{D + D}{\sqrt{2}} ,

\end{equation}

where the two $D$’s are independent random variables with the same distribution. So $D$ and $(D + D)/\sqrt{2}$ must have the same distribution. By grouping terms in different proportions, we can derive similar properties of $D$. Since the normal distribution satisfies

\begin{eqnarray}

\mathcal N(\mu_1, \sigma_1^2) + \mathcal N(\mu_2, \sigma_2^2) & = & \mathcal N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \\

c \mathcal N(\mu, \sigma^2) & = & \mathcal N(c\mu, c^2\sigma^2) ,

\end{eqnarray}

it is a natural candidate for the distribution $D$.

To prove the central limit theorem, we will make use of the moment generating function

\begin{equation}

M_{X}(t) = \mathbb E [\exp (tX) ] = 1 + \mathbb E[X] t + \frac{1}{2} \mathbb E[X^2] t^2 + \cdots

\end{equation}

and the cumulant generating function

\begin{equation}

K_{X}(t) = \log \mathbb E[\exp (tX) ] = \mathbb E[X] t + \frac{\mathbb E[X^2] – \mathbb E[X]^2}{2} t^2 + \cdots .

\end{equation}

The coefficients of the moment generating function and the cumulant generating function (divided by $n!$) are referred to as “moments” and “cumulants” respectively. The cumulants and the moments are closely related, and the values of one determine the values of the other. Incidentally, the moment generating function and the cumulant generating function of the normal distribution $\mathcal N(\mu,\sigma^2)$ are given by

\begin{eqnarray}

M(t) & = & \exp\left(\mu t + \frac{\sigma^2}{2} t^2 \right) \\

K(t) & = & \mu t + \frac{\sigma^2}{2} t^2 .

\end{eqnarray}

Note that the moment generating function satisfies

\begin{eqnarray}

M_{X+Y}(t) & = & M_{X}(t) M_{Y}(t) \\

M_{cX}(t) & = & M_{X}(ct) .

\end{eqnarray}

It follows that the cumulant generating function satisfies

\begin{eqnarray}

K_{X+Y}(t) & = & K_{X}(t) + K_{Y}(t) \\

K_{cX}(t) & = & K_{X}(ct) .

\end{eqnarray}

Now, we are going to use all of these tools. Let’s inspect the cumulant generating function of $A_N$. We have

\begin{equation}

K_{A_N}(t) = K_{X_1}\left(\frac{t}{\sqrt{N}}\right) + \cdots + K_{X_N}\left(\frac{t}{\sqrt{N}}\right) .

\end{equation}

Let $k^m_{X}$ be the $m$th coefficient of $K_{X}(t)$. Equating powers of $t$ above, we get

\begin{equation}

k^m_{A_N} = \left(k^m_{X_1} + \cdots + k^m_{X_N} \right) N^{-\frac{m}{2}} .

\end{equation}

From the case $m=1$, we see that

\begin{equation}

\mathbb E[A_N] = \frac{\mathbb E[X_1] + \cdots \mathbb E[X_N]}{\sqrt{N}} = 0,

\end{equation}

as expected. From the case $m=2$, we see that

\begin{equation}

\textnormal{var}(A_N) = \frac{\textnormal{var}(X_1) + \cdots \textnormal{var}(X_N)}{N} = \sigma^2,

\end{equation}

also as expected. But what about higher cumulants? For $k > 2$, as $N \to \infty$, we have

\begin{equation}

k^m_{A_N} = k^m_{X_1} N^{1-\frac{N}{2}} \to 0 .

\end{equation}

Therefore, the higher cumulants all vanish. It follows that the cumulants of the sequence $\{A_N\}$ converge to the cumulants of $\mathcal N(0,\sigma^2)$. Therefore, the moments of the sequence $\{A_N\}$ converge to the moments of $\mathcal N(0,\sigma^2)$. It follows from Levy’s continuity theorem, that $A_N \to \mathcal N(0,\sigma^2)$, as desired.

]]>N = 1e7; tic w = zeros(1, N); for i = 1:N w(i) = i*5; end fprintf('For loop: %.4fs\n', toc); tic y = (1:N)*5; fprintf('Vectorized: %.4fs\n', toc);

The code takes the integers from 1 to 10 million and multiplies each one by 5. When I run this on my machine (Mac Mini), the output is:

For loop: 0.0590s

Vectorized: 0.0720s

This means that the “for” loop is *faster* than the vectorized code, contrary to the conventional wisdom that loops are much slower than vectorized code in MATLAB. Now I want to try the same thing without any JIT acceleration. To do this, I type *feature(‘accel’,’off’)* into the MATLAB Command Window and run the code again. Doing so gives the following results:

For loop: 19.4777s

Vectorized: 0.0487s

I can’t vouch for why the vectorized code actually went faster in this run, although the effect seems to be reproducible (perhaps there is a fraction of a second overhead trying to accelerate, which is wasted when the code can’t be accelerated). However, the important point is that the loop implementation is slowed down by more than 300x when acceleration is turned off. This is JIT acceleration at work (or not at work, in this case).

[Note: if you are following along, don’t forget to type *feature(‘accel’,’on’)* to turn the accelerator back on when you’re done!]

Having said all this, I still suggest writing vectorized MATLAB code whenever possible. The JIT accelerator is fragile and some fairly simple operations can cause the accelerator to skip over a loop. Still, I wanted to bring this to your attention, at least to help you avoid embarrassing situations like one I had a few years ago as a TA. I was giving a MATLAB tutorial to freshmen at Princeton and I was preaching the virtues of writing vectorized code. I decide to do a demonstration much like the one above (but with computing Fibonacci numbers, using a loop vs. vectorized using the Fibonacci formula). To my embarrassment the demo failed completely: the loop performed better than the vectorized code (and avoided overflow/underflow errors to boot!). That is how I first discovered MATLAB’s mysterious JIT accelerator.

]]>Introspection has a long and distinguished history in the early days of AI:

- Newell and Simon’s General Problem Solver [1] was explicitly based on intuitions about how we solve problems, and they developed it by working through lots of examples by hand
- Lenat’s Automated Mathematician [2] was supposed to make mathematical conjectures based on heuristics that human mathematicians followed
- Hofstadter’s book
*Fluid Concepts and Creative Analogies*[3] details AI systems which serve as cognitive models for a variety of toy problems. Some nice examples are Seek-Whence (for explaining number sequences) and Copycat (for letter string analogies).

Like Gowers’s work, these classic systems were intended primarily as cognitive models. Simon was inspired to work on AI when he did consulting work on group decision making for the air force and found that he didn’t have an adequate formalism for explaining their decision making process [4]. Lenat was at least partially motivated by formalizing Polya’s notion of a heuristic, since he thought informal lists of heuristics probably oversimplified the process of mathematical discovery and suffered from hindsight bias [5]. The idea was that translating our own heuristics into mathematically precise algorithms gives a more rigorous way of testing which strategies are capable of solving which problems.

But introspection has largely fallen out of favor in AI, at least the parts I’m familiar with (machine learning, vision, probabilistic inference). The field has a Behaviorist-like aversion to thinking about algorithms in terms of plausible mental representations and cognitive strategies. I can think of several plausible arguments against introspection:

- We already have rough outlines of a solution for most of the problems we’re interested in, and introspection won’t help us fill in any of the details.
- Computers are a different enough computational architecture from the brain that we can’t expect any insights to carry over. I.e., computers can perform billions of sequential operations per second (which often makes brute force a more viable strategy), whereas the brain has more effective associative memory and high-level representations.
- Introspection isn’t scientific, since different people will disagree. The explanations are likely just someone’s idealized story about what’s actually a much messier process. (E.g., cognitive scientist Jeff Shrager spent three years training to be a molecular biologist, and during this time, he couldn’t find any examples of the cross-domain analogies that cognitive scientists had touted as so fundamental to scientific discovery.)
- It’s hard to predict what range of problems an introspection-based strategy will generalize to.

As far as the first point, I think there are still plenty of hard problems for which we don’t even have the outlines of a solution. The remaining points seem hard to argue with, but they basically just show that introspection is a source of intuition rather than a rigorous way of evaluating algorithms. I’d say we have three major sources of intuition in our field:

- trying out models and algorithms in a variety of settings, both real-world applications and toy problems
- peer-reviewed research in cognitive science and neuroscience
- introspection

Neuroscience and cog sci both had a large influence on both machine learning and computer vision in the early days, but as the field as matured, algorithms have made up a larger and larger part of our intuitions. Introspection is rarely mentioned.

In my own work on structure learning, my co-authors and I made several design decisions based on how human researchers would approach a problem. In my matrix decompositions work [6]:

- we often come up with structured probabilistic models by first fitting a simpler one to the data, seeing if there are dependencies in the learned representation which aren’t captured by the model, and adding those explicitly to the model
- if we’re trying to do inference in a complex probabilistic model, it often works to initialize it from a simpler one

And for Gaussian process structure search [7]:

- a good strategy for building up structured GP models is to try to model the residuals of the current model, and add the resulting kernel to the model
- when we use the above strategy, the previously learned kernel hyperparameters make a good initialization for the larger model

Of course, for the aspects of our algorithms that have already been well-explored (such as inference in particular models), we went with the standard approaches. There’s no point in thinking, “How would I try to factorize this matrix…” But since hardly anyone has explored how to structure model spaces in a way that’s searchable, these intuitions provided useful guidance in the design of our spaces of models.

Like any source of intuition, introspection is just a starting point. It’s not a rigorous scientific argument, and it ultimately needs to be backed up with math and experiments. What we really want is an *explanation* for the heuristics people use. In other words,

- figure out what regularities our heuristics are exploiting
- see if existing algorithms already exploit those same regularities
- if not, come up with one that does
- justify it theoretically and empirically

You can think of this as a sort of “inverse Bayesian optimization.” I’m currently working on taking the heuristics from my structure search papers and explaining mathematically why they work, so that we know in what set of situations they are appropriate. Hopefully this will result in a more general and more robust way to exploit the same structure.

[1] A. Newell, J. C. Shaw, and H. A. Simon. Report on a general problem-solving program. Tech report, 1958

[2] D. B. Lenat and J. S. Brown. Why AM and Eurisko appear to work. Artificial Intellgience, 1984.

[3] D. R. Hofstadter. *Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought*. Basic Books, 1996.

[4] H. A. Simon. *Models of My Life*. MIT Press, 1996.

[5] D. B. Lenat. The Nature of Heuristics. Tech report, 1980.

[6] R. B. Grosse, R. Salakhutdinov, W. T. Freeman, and J. B. Tenenbaum. Exploiting compositionality to explore a large space of model structures. UAI, 2012.

[7] D. Duvenaud, J. R. Lloyd, R. B. Grosse, J. B. Tenenbaum, Z. Ghahramani. Structure discovery in nonparametric regression through compositional kernel search. ICML, 2013.

]]>Because I did not bother learning how to render equations on my web site, I have had to write without equations. While annoying at times, I think this may actually help with the clarity of the definitions because I cannot hide behind equations and pretend that they explain everything. The lack of equations also keeps things at a high-level picture, which is the goal of the glossary.

At the moment the glossary is not a Wiki, but perhaps I will move to that model in the future. For now, it is an experiment and I hope it is helpful to at least a few people. If you have any thoughts on how I could improve the glossary, I would be interested to hear them!

]]>