Machine Learning that Doesn’t Matter

Michael Gelbart Machine Learning, Meta Leave a Comment

At ICML last year, Kiri Wagstaff (KW) delivered a plenary talk and accompanying paper entitled “Machine Learning that Matters.” KW, a researcher at the NASA Jet Propulsion Laboratory (JPL), draws attention to a number of very serious issues in the field but draws conclusions that differ from my own.

KW criticizes existing benchmark data sets such as the UCI data sets or the MNIST handwritten digit data set for being irrelevant or obsolete. I certainly agree that being state-of-the-art on MNIST is not necessarily important (see my last post for more discussion on the need carefully craft competitions based on benchmark data sets).

As an alternative to these seemingly arbitrary benchmark data sets, KW introduces six Machine Learning (ML) Impact Challenges, directly reproduced here:

  1. A law passed or legal decision made that relies on the result of an ML analysis.
  2. $100M saved through improved decision making provided by an ML system.
  3. A conflict between nations averted through high quality translation provided by an ML system.
  4. A 50% reduction in cybersecurity break-ins through ML defenses.
  5. A human life saved through a diagnosis or intervention recommended by an ML system.
  6. Improvement of 10% in one country’s Human Development Index (HDI) (Anand & Sen, 1994) attributable to an ML system.

While these are laudable goals, I do not believe that they can replace benchmarks as we know them because objectives need to be evaluated easily. How many machine learning researchers, especially those earlier on in our careers, will have the opportunity to evaluate how much money was saved through a decision made by our systems? Impact Challenge #1 is even more difficult: at best, it can only be evaluated whenever a legal decision is made, the causal effect of the ML analysis would be difficult to trace, and incremental progress (i.e., the law was closer to being passed because of the ML analysis) is not easily evaluated.

Thus, while I agree that certain benchmark data sets are overused and may have no bearing on real applications, I do not think the solution is to do away with them altogether and focus only on real-world impact. Real-world impact is important but messy, and may not be best suited to basic research in machine learning. This is partly a philosophical question of how we in academia see our roles in society — and perhaps this question is viewed a bit differently in government laboratories like the JPL than in academia (and probably more so in industrial research labs, some of which play a big role in ML research). However, philosophy aside, I do not think it is sensible for researchers to be evaluated on implementation of their methods in real-world settings.

In her paper, KW encourages systems that are useful for a diversity of applications, instead of focusing in on a single data set. She uses an example of a hypothetical ML system that both increases profits in an auto-tire business and avoids unnecessary surgeries. In this case I feel that my role is clear: I do not work for an auto-tire business, and I do not think increasing auto-tire profits should be a goal for my work. If a method is broadly useful, then an engineer at the auto-tire business, having read the paper, could implement it. More generally, I am in favor of a middle ground between over-specializing for specific data sets and generalizing to both tires and surgery. For example, I believe the work by Torralba and Efros on data set bias within the field of object recognition is important and valuable work.

Finally, limiting our goals to real-world impact implicitly focuses on applications that already exist. There are many examples in history of research breakthroughs that were not yet ready to be applied because the relevant technologies were yet to be invented. Shannon worked out the mathematics of error-correcting codes before integrated circuits and lasers. Thus he could not have known he was solving the real-world challenge, “Make a DVD that plays a movie correctly even when scratched.” If Shannon was evaluated by his impact in the home theater industry, then perhaps he would not have done the great work that his freedom at Bell Labs allowed him to do.


Leave a Reply