Artificial Intelligence medium · first-party

The benchmark ceiling

The model topping both the headline coding test and the PhD-level science test is one no one outside fifty vetted partners can use — and it beat the next two models by less than half a point.

demo Anthropic · 2 min read

Anthropic says its Claude Mythos Preview answered 94.6% of GPQA Diamond, a set of 198 graduate-level biology, chemistry and physics questions on which actual PhD experts in the field score about 65%. It cleared the headline software-engineering test, SWE-bench Verified, at 93.9%. Both are at or near the top of the public leaderboards.

On GPQA Diamond, all frontier models now sit in the 91-95% range.

The scores are less a record than a wall. On the science test, Mythos beats Google's Gemini and Anthropic's own prior model by a fraction of a point; every frontier model now lands between 91% and 95%, while the humans the test was built to stump stay at 65%. The number that matters isn't who is on top by a few tenths — it's that the leaders are now bunched against the ceiling, and the test no longer separates them. A confounder makes even the wall blurry: an OpenAI audit of the hardest unsolved coding problems found most had flawed test cases, so a high score may partly reward gaming the grader.

The sharper fact is who holds the top line. Mythos is not for sale. Anthropic withheld it from general release because its ability to find software vulnerabilities on its own is judged too dangerous to ship, handing it instead to roughly fifty vetted defensive-security partners. The most capable model anyone has measured is the one no one can buy — and by June a successor had already crept past these figures, making April's high-water mark a footnote within two months.

When the best models cluster at the top of every test a fraction of a point apart, the tests stop being a useful way to rank them. The scores keep climbing; what they measure gets thinner.

The lenses

Novelty 2

Impact · breadth 2

Impact · depth 3

Actionable 1

Substance 2

Hype 3

The facts

Top scores94.6% on the PhD-level science test (humans ~65%); 93.9% on the headline coding test

The clusterBeats the next two models by less than half a point — every frontier model now scores 91-95%

AvailabilityNot for sale; restricted to ~50 vetted defensive-security partners

SourceFigures from Anthropic's launch materials, not an inspectable benchmark report

Concepts

Frontier models Autonomous vulnerability research AI benchmarks

Open anthropic.com →

How this connects

Tap a node to open it

The benchmark ceiling

The lenses

The facts

Concepts

More in Artificial Intelligence

Safety's rounding error

The Jevons bill comes due

Money stopped being the bottleneck

How this connects