Life After Benchmark Saturation: A Case Study of CORE-Bench

LLMs 📅 2026-06-27 👁 36 views 🏷 AI benchmark saturation, agent performance evaluation, CORE-Bench, computational reproducibility, human-AI collaboration, construct validity, out-of-distribution generalizability, AI efficiency and reliability

Life After Benchmark Saturation: A Case Study of CORE-Bench

Submitted on 23 Jun 2026

Authors: Nitya Nadgir, Sayash Kapoor, Kangheng Liu, Peter Kirgis, Matilda Orona, Stephan Rabanser, Tilman Bayer, Abhishek Shetty, Yue Ling, Derrick Chan-Sew, Rumi Nakagawa, Saiteja Utpala, Zachary S. Siegel, Arvind Narayanan

Subject: Artificial Intelligence (cs.AI)

Abstract

When a benchmark’s accuracy saturates, it is typically retired and replaced with a more challenging version. We argue that this accuracy-centric approach misses critical opportunities to evaluate six other dimensions of agent performance: construct validity issues (such as shortcuts), out-of-distribution generalizability, efficiency, reliability, the relative contributions of the model versus the scaffold, and the uplift from human-agent collaboration. Using CORE-Bench Hard—a benchmark designed to test computational reproducibility of scientific code—we demonstrate that measuring agents along these dimensions yields meaningful insights even after accuracy has plateaued.

First, we identify threats to construct validity in CORE-Bench Hard that are difficult to anticipate with less capable agents. We introduce an improved version, CORE-Bench v1.1, along with an out-of-distribution task suite, CORE-Bench OOD. Second, despite accuracy saturation, we find that CORE-Bench v1.1 remains valuable for assessing efficiency, reliability, model performance, and scaffold performance. Finally, we conduct a small-scale randomized experiment to measure the uplift from human-agent collaboration on real-world computational reproducibility tasks. We observe a statistically significant speedup of approximately two-fold—likely an underestimate, as one-fifth of human-only reproductions reached the time limit before completion—and report various other findings. Together, our contributions present a more rigorous alternative to the dominant accuracy-centric evaluation paradigm.

Keywords: AI benchmark saturation, agent performance evaluation, CORE-Bench, computational reproducibility, human-AI collaboration, construct validity, out-of-distribution generalizability, AI efficiency and reliability

via ArXiv AI

Life After Benchmark Saturation: A Case Study of CORE-Bench

Life After Benchmark Saturation: A Case Study of CORE-Bench

Abstract

Related