Artificial Intelligence medium · first-party

Nobody checked the magic prompt

A year after a viral prompt supposedly made OpenAI's o3 superhuman at pinpointing where a photo was taken, someone finally A/B-tested it against a one-line default. The elaborate version lost.

paper Sean Goedecke · 2 min read

In April 2025 a long, lovingly iterated prompt went around claiming to turn OpenAI's o3 into a savant at GeoGuessr — guessing a location from a single photo. People marveled at it for thirteen months. Nobody ran the obvious test.

"Models will happily make up stories for you about their own reasoning processes, and will almost always say 'yes, that helped.'" — Sean Goedecke

Sean Goedecke did. He fed roughly 200 photos to o3 twice: once with the famous engineered prompt, once with a bare 'where was this taken?' He scored each guess by how far off it was. The default won on every measure — closer median guess, closer average, more hits inside 100 kilometers. The fancy prompt didn't just fail to help; it slightly hurt. The whole experiment cost about fifteen dollars and an afternoon.

The mechanism is the quietly damning part. When a model is already good at something, extra instructions ride along without adding signal — and the usual way people 'refine' a prompt is to ask the model whether a change helped. It almost always says yes. A result everyone admired turned out to rest on that flattery, never on a measurement.

The case isn't airtight, and Goedecke says so: no significance test, modest gaps, and the photos are public-domain images o3 may have partly memorized rather than reasoned about. But the real finding sits above the numbers. It took one person, one afternoon, and pocket change to check a celebrated AI claim that a year of confident writing never bothered to.

The lenses

Novelty 2

Impact · breadth 3

Impact · depth 2

Actionable 2

Substance 3

Hype 2

The facts

The test~200 photos, scored by distance error: famous prompt vs. a one-line default

ResultDefault beat the engineered prompt on median, mean, and hits within 100 km

Cost to settle itAbout $15 and one afternoon — over a year after the prompt went viral

The caveatNo significance test; public-domain photos may have been in the model's training data

Concepts

AI benchmarks Frontier models Prompt engineering

Open seangoedecke.com →

How this connects

Tap a node to open it

Nobody checked the magic prompt

The lenses

The facts

Concepts

More in Artificial Intelligence

Safety's rounding error

The Jevons bill comes due

Money stopped being the bottleneck

How this connects