-
Why we need intentionally out of domain benchmarks
2025-08-03
With Reinforcement Learning with Verifiable Rewards working very well, obliterating nearly all benchmarks (Grok 4 scores 50 % on Humanity's Last Exam), to get a good snapshot of a model, we can go two ways:
- Evaluate on increasingly more complex real-world tasks
Benchmarks such as SWE-bench, Aider Polyglot and more generally τ-bench attempt to replicate real-world scenarios where the AI might be useful and evaluate them there.
This approach captures the current use cases well, but the bigger question we're interested in as we attempt to understand, model and predict AI progress, is the generalization ability of the system. How well can models perform on tasks they haven't been trained for?