▸ Concept also: AI evaluation, model evaluation, benchmark saturation

AI benchmarks

Standardised tests used to measure what AI models can do — and the main proxy for tracking whether the field is making progress.

Leads to

Forecasting →

In a nutshell

A benchmark is a fixed dataset of tasks with known correct answers: a model runs it, a number comes out, and labs use that number to compare models and claim progress. The hard part is the gap between the number and real capability. A model can improve on a benchmark by seeing similar questions during training (contamination), by being tuned to the test format, or because the tasks were too narrow to begin with. When every frontier model clusters near the ceiling, the benchmark stops discriminating — which is what "saturated" means. New benchmarks get harder, the cycle repeats, and the field debates whether any of it tracks what we actually care about.

Where it came from

Year1998

SourceMNIST (LeCun et al.) established the template; modern NLP benchmarks descend from GLUE (Wang et al., 2018) and SuperGLUE (2019).

Why it matteredThe pattern — fixed held-out test set, single accuracy number — is older than deep learning; MNIST is the canonical early instance.

In megatrends

Artificial Intelligence

Models, agents, and AI–human collaboration — general-purpose capability scaling into every domain.

AI benchmarks

Leads to

Where it came from

In megatrends

Artificial Intelligence

Related players

How this connects

AI benchmarks

Leads to

Where it came from

In megatrends

Artificial Intelligence

Related players

Finds citing this concept

China takes the video-model lead

Nano Banana 2

The forecast that won't update

autoresearch

The benchmark that caught itself

The model that tuned itself

The forecasting gap, in Brier points

Nobody checked the magic prompt

The catalyst claim that grew in transit

The benchmark ceiling

How this connects