Mentatcurated
Artificial Intelligence medium · independent

The forecast that won't update

A new benchmark feeds models real news in the order it actually arrived and asks them to keep revising their predictions — and finds they mostly cling to their first guess.

Most forecasting tests freeze a model at one moment and grade a single prediction. FutureSim instead replays the world. Starting from a fixed cutoff in late December, it hands a model real news articles in the order they were published and asks it to keep updating forecasts on 330 questions that resolve over the following three months — the Super Bowl, the Grammys, a UK district election — with no live internet, only the dated feed it is given.

The authors call their own numbers 'a lower-bound on agent performance.'

The point is to test whether a model changes its mind as events unfold. Mostly it doesn't. The paper finds 'marked inertia': models anchor to their opening prediction and barely move even when later articles plainly contradict it. The scores follow. On a calibration measure where zero means you'd have done just as well refusing to predict, every open-weight model came in negative — worse than abstaining. Only one model, OpenAI's GPT-5.5, cleared zero at all, finishing first with 25 percent of its top guesses correct.

That 25 percent is the optimistic read of a discouraging result, and the bar matters more than the number. Against a real money-weighted crowd, GPT-5.5 edged ahead of the Polymarket aggregate on the Super Bowl question but trailed it on the others. The failure the replay was built to catch is the one that lands: the hard part of forecasting isn't the first guess, it's noticing when the world has moved — and that is exactly what the models don't do.

The lenses

Novelty 3
Impact · breadth 2
Impact · depth 3
Actionable 4
Substance 5
Hype 1

The facts

What it isA public benchmark and paper from an academic team (arXiv 2605.15188) — methods and questions are open to inspect
The resultGPT-5.5 finished first at 25% accuracy and was the only model to beat the do-nothing baseline; open-weight models all scored worse than not predicting
The catchModels 'anchor strongly to initial predictions' and don't revise as new events arrive — and the test omits some current frontier models
Open arxiv.org →

How this connects

Tap a node to open it