What AI researchers can learn from the NFL Combine

How useful is your AI? This is not a simple question to answer. If you’re trying to decide between Google’s translation service or Microsoft’s translator, how do you know which is better?

If you’re an AI developer, chances are you think the answer is: benchmarks. But that’s not the whole story.

The big idea

Benchmarks are necessary, important and useful in the context of their own domain. If you’re trying to train an AI model to distinguish between cats and dogs in images, for example, it’s very useful to know how it works.

But since we literally can’t take our model out and use it to scan every image of a cat and dog that ever existed or ever will exist, we have to somehow guess how good he will be at his job.

To do this, we use a benchmark. Basically, we take a bunch of photos of cats and dogs and label them correctly. Then we hide the labels from the AI ​​and ask it to tell us what’s in each frame.

If it scores 9 out of 10, it is 90% accurate. If we think 90% accuracy is good enough, we can call our model successful. Otherwise, we keep training and tweaking.

The big problem

How much money would you pay for an AI that could tell a cat from a dog? A billion dollars? Half a nickel? Nothing? Probably nothing. This wouldn’t be very useful outside of benchmark rankings.

However, an AI capable of labeling all objects in a given image would be very useful.

But there is no “universal reference” to label objects. We can only guess how good such an AI would be at its job. Just as we don’t have access to all existing images of a cat and a dog, neither can we tag everything that could may be to exist as an image.

And that means any benchmark measuring an AI’s ability to label images is arbitrary.

Is an AI that is 43% accurate at tagging images from a billion categories better or worse than an AI that is 71% accurate at tagging images from 28 million categories? Are categories important?

Ben Dickson from BD Tech Talks put the best in a recent article:

The focus on benchmark performance has drawn a lot of attention to machine learning at the expense of other promising research directions. With the increasing availability of data and computational resources, many researchers find it easier to train very large neural networks on huge datasets to push the needle on a well-known benchmark rather than experimenting with approaches alternatives.

We build AI systems that do very well in tests, but they often fail to perform well in the real world.

The big solution

It turns out that guessing performance at scale is not an isolated problem in the world of AI. In 1982, National Football Scouting Inc. held the first “Combine NFLto address issues with busts – players not performing as well as expected.

In the pre-internet era, the only way to assess players was in person and the travel costs required to scout hundreds or thousands of players throughout the year became too much. The Combine was a place where NFL scouts could gather to judge player performance at the same time.

Not only did this save time and money, but it also set a universal benchmark. When a team wanted to trade or release a player, the other teams could refer to their “benchmark” performance at the combine.

Of course, there are no guarantees in sports. But, essentially, the Combine puts players through a series of drills that are specifically relevant to the sport of football.

However, the Combine is only a small part of the recognition process. In the modern era, teams hold private practices for players to determine if a prospect seems like a fit for an organization’s specific system.

Another way of saying this would be that NFL team developers use a model’s benchmarks as a general predictor of performance, but they also perform rigorous external checks to determine model performance. usefulness in a specific area.

A player can blow your mind at the Combine, but if he fails to impress in individual practices, chances are he won’t make the squad.

Ideally, benchmarking in the world of AI would simply represent the first cycle of rigor.

As a team of researchers from UC Berkeley, University of Washington, and Google Research recently wrote:

Benchmarking, deployed appropriately, is not about winning a competition, but rather about studying a landscape – the more we can appropriately reframe, contextualize and delineate these datasets, the more useful they will become as an informative dimension. for more impactful algorithmic development and an alternative. evaluation methods.

James G. Williams