Mentatcurated
▸ Concept also: AI evaluation, model evaluation, benchmark saturation

AI benchmarks

Standardised tests used to measure what AI models can do — and the main proxy for tracking whether the field is making progress.

Leads to

In a nutshell

A benchmark is a fixed dataset of tasks with known correct answers: a model runs it, a number comes out, and labs use that number to compare models and claim progress. The hard part is the gap between the number and real capability. A model can improve on a benchmark by seeing similar questions during training (contamination), by being tuned to the test format, or because the tasks were too narrow to begin with. When every frontier model clusters near the ceiling, the benchmark stops discriminating — which is what "saturated" means. New benchmarks get harder, the cycle repeats, and the field debates whether any of it tracks what we actually care about.

Where it came from

Year1998
SourceMNIST (LeCun et al.) established the template; modern NLP benchmarks descend from GLUE (Wang et al., 2018) and SuperGLUE (2019).
Why it matteredThe pattern — fixed held-out test set, single accuracy number — is older than deep learning; MNIST is the canonical early instance.

How this connects

Tap a node to open it