Metrics drive the ML field. As such, if we want to influence the field or popularize new subfields, we must define the metrics that correlate with progress on the problems we care about. Formalizing these metrics into benchmarks will be crucial to capturing the attention of researchers and driving progress.
Building good benchmarks is difficult, in large part because benchmarks exhibit many of the properties that produce power law outcomes. First, implicit in every benchmark design are a large number of multiplicative processes. If even one facet (e.g. ease of use, cost to evaluate, connection of the benchmark with a real problem, tractability, difficulty to game, feasibility of developing new methods to improve the state-of-the-art, etc.) of the benchmark is deficient, it may entirely prevent the benchmark from having impact. Second, benchmarks face strong preferential attachment dynamics: the most used benchmarks are the most likely to be used further. Finally, benchmarks are inherently located on the edge of chaos, since their design requires the researcher to effectively concretize a nebulous notion into a single number.
Perhaps the most important quality of a benchmark is having clear evaluation. To do this, the following properties are useful:
Having too many barriers to entry can sink a benchmark. If researchers do not enjoy using a benchmark, this severely diminishes its usefulness, even if the benchmark pertains to something very important. We present some pointers for removing barriers to entry:
The most difficult part of designing a benchmark is concretizing a nebulous idea into a metric. Doing this first requires thinking of an idea to be tested, which is difficult in itself, but thinking of how to concretize it is even more difficult. The aim is to create a microcosm that has good properties and is both useful for the broader problem and simple enough that it can be concretized and improved. The task requires foresight: where will the machine learning field go next? What is becoming possible that previously wasn’t?
Existing successful benchmarks are likely to have useful features, even if it’s not easily apparent why they’re useful (perhaps even to the creators of the benchmark). If you don’t know why a benchmark is evaluated in a certain way, do not assume there is no good reason. Successful benchmarks have stood the test of time, and so should be emulated to the extent possible. To give one example, the high number of categories used in ImageNet turned out to be important to ensure performance was less gameable, though this was not obvious until a decade after its original list.
Being ready to design a benchmark requires understanding the community that the benchmark is being designed for. This likely means doing research in the area, talking to others doing research, and absorbing what it’s like to use a benchmark. It’s not wise to try to swoop into a field without knowing anything about how researchers in the field approach their work. Set aside a large chunk of time to humble yourself to more effectively serve the community you aim to mobilize. An additional benefit is that listening to researchers’ intuitions on a given problem might give ideas for how to concretize it. One way of doing this is to collect their phrases for “what they mean” by a particular fuzzy concept, and put them into a word cloud. When designing a benchmark, ask: “what are they trying to get at? how does this benchmark idea address a large share of the words/phrases in the word cloud?”
Designing benchmarks means constantly having nebulous goals for new benchmarks swirling around while researching. Most ideas will likely not be good, or won’t be ready for a benchmark, so pursuing the first benchmark idea one comes up with is not a good strategy. Having a number of benchmark ideas in mind is useful in case a new advancement makes an idea suddenly feasible. When building a benchmark, temper your expectations. From an outside view, it is quite difficult to design a good metric, or else it would be easy to write highly impactful papers.
The internet has a vast amount of data that can be collected. If it appears necessary to spend a large amount of money (millions of dollars) on a benchmark, it makes sense to spend more time scouring the internet (note that this is different for applications projects, such as self-driving cars). If there is nothing relevant on the internet, this may indicate that the idea might not be that good or the data you seek isn’t useful, as it has not been interesting enough to appear anywhere on the internet. Note that once a good idea is found, it is usually mostly grunt work to generate the dataset; the majority of the intellectual difficulty of benchmark design arises from finding an idea that will work well as a benchmark, not collecting the data. These two are conflated, leading researchers to incorrectly assume they can easily create something useful to mobilize the community.
One useful heuristic for designing benchmarks is to try to include many sources of variability. For instance, including multiple types of adversarial attacks, multiple environments, using different writers and so on is very useful because it allows for making progress in a number of dimensions at once. Beware of believing that there are more dimensions than they are: for instance, procedurally generated data may appear to have many dimensions (“we have infinitely many attacks!”), but if there is a simple-to-describe generating process for data, there are not many dimensions to it. Adding a random number generator to choose the coefficient for a particular piece of data only adds a single new dimension, not infinitely many. For this reason, data on the internet can be quite useful, since it is often generated with many unique generating processes that add additional structural complexity (rather than merely increasing entropy). Concretely, a mathematics benchmark with handwritten word problems has many more structures and sources of variability than algorithmically generated mathematics problems. Even though with the latter one can have infinitely many distinct problems, the underlying data generating process is much less complex. Aim for multiple sources of variability, ideas, or structures to make the optimized metric less gameable and more likely to capture more real-world properties.
Lastly, after developing a benchmark, it’s necessary to maintain the area to make sure that researchers are doing things correctly. Any benchmark is of course susceptible to being gamed (some more than others): researchers can create methods that exploit some peculiarity of the benchmark rather than make progress towards the overall problem. Peer review can be a good safeguard against this, as reviewers might recognize that the approach is gaming the benchmark. However, sometimes reviewers don’t know any better, and that’s why it’s often necessary to reduce the effect by producing high-quality research towards the benchmark. Of course, it’s understood that benchmarks might eventually be solved, but this doesn’t mean that the problem itself is solved. In such cases, new benchmarks are necessary to expose gaps in the original benchmark.
In summary, successful benchmarks are not simply driven by an army of dataset annotators, they are driven by intellectual humility, attentiveness to numerous (often implicit) factors, and original ideas.
Dan Hendrycks is the Executive Director of the Center for AI Safety. Dan contributed the GELU activation function, the main baseline for OOD detection, and benchmarks for robustness (ImageNet-C) and large language models (MMLU, MATH).
Thomas Woodside contributed to drafting this post in 2022 when he was CAIS’s first employee. He now works at an AI policy think tank in Washington, DC. His website is here.