Why we need intentionally out of domain benchmarks

With Reinforcement Learning with Verifiable Rewards working very well, obliterating nearly all benchmarks (Grok 4 scores 50 % on Humanity's Last Exam), to get a good snapshot of a model, we can go two ways:

  1. Evaluate on increasingly more complex real-world tasks

Benchmarks such as SWE-bench, Aider Polyglot and more generally τ-bench attempt to replicate real-world scenarios where the AI might be useful and evaluate them there.

This approach captures the current use cases well, but the bigger question we're interested in as we attempt to understand, model and predict AI progress, is the generalization ability of the system. How well can models perform on tasks they haven't been trained for?

  1. Evaluate on tasks so out of domain nobody would think to include them in the training data

my bar for agi is an ai that can learn to run a gas station for a year without a team of scientists collecting the Gas Station Dataset -- roon (@tszzl) on Twitter

The models, just like humans, perform much better at tasks for which they could study and practice. As roon hints at, the labs build RL environments and add datasets for whatever they identify as weak points of their models to specificially teach their model the related abilities.

Examples of such benchmarks are few and far between, but they include:

  • AidanBench, where the model is asked to keep generating various disimilar answers to a given question and is scored on how many different answers it can come up with.
  • MC-Bench, where models compete to build elaborate structures in Minecraft based on a prompt.

These are obviously useless abilities in a model (with AidanBench being a rough proxy for creative ability), which is the point – we're out of domain.

Yet generalization ability is not useful only for predictions – how robust is a model to prompt changes, to exotic tool harnesses, to