Artificial Intelligence medium · independent

The forecast that won't update

A new benchmark feeds models real news in the order it actually arrived and asks them to keep revising their predictions — and finds they mostly cling to their first guess.

paper ELLIS Institute Tübingen / Max Planck Institute for Intelligent Systems · 2 min read

Most forecasting tests freeze a model at one moment and grade a single prediction. FutureSim instead replays the world. Starting from a fixed cutoff in late December, it hands a model real news articles in the order they were published and asks it to keep updating forecasts on 330 questions that resolve over the following three months — the Super Bowl, the Grammys, a UK district election — with no live internet, only the dated feed it is given.

The authors call their own numbers 'a lower-bound on agent performance.'

The point is to test whether a model changes its mind as events unfold. Mostly it doesn't. The paper finds 'marked inertia': models anchor to their opening prediction and barely move even when later articles plainly contradict it. The scores follow. On a calibration measure where zero means you'd have done just as well refusing to predict, every open-weight model came in negative — worse than abstaining. Only one model, OpenAI's GPT-5.5, cleared zero at all, finishing first with 25 percent of its top guesses correct.

That 25 percent is the optimistic read of a discouraging result, and the bar matters more than the number. Against a real money-weighted crowd, GPT-5.5 edged ahead of the Polymarket aggregate on the Super Bowl question but trailed it on the others. The failure the replay was built to catch is the one that lands: the hard part of forecasting isn't the first guess, it's noticing when the world has moved — and that is exactly what the models don't do.

The lenses

Novelty 3

Impact · breadth 2

Impact · depth 3

Actionable 4

Substance 5

Hype 1

The facts

What it isA public benchmark and paper from an academic team (arXiv 2605.15188) — methods and questions are open to inspect

The resultGPT-5.5 finished first at 25% accuracy and was the only model to beat the do-nothing baseline; open-weight models all scored worse than not predicting

The catchModels 'anchor strongly to initial predictions' and don't revise as new events arrive — and the test omits some current frontier models

Concepts

Forecasting AI benchmarks

Open arxiv.org →

How this connects

Tap a node to open it

The forecast that won't update

The lenses

The facts

Concepts

More in Artificial Intelligence

Safety's rounding error

The Jevons bill comes due

Money stopped being the bottleneck

How this connects