The benchmark ceiling
The model topping both the headline coding test and the PhD-level science test is one no one outside fifty vetted partners can use — and it beat the next two models by less than half a point.
Anthropic says its Claude Mythos Preview answered 94.6% of GPQA Diamond, a set of 198 graduate-level biology, chemistry and physics questions on which actual PhD experts in the field score about 65%. It cleared the headline software-engineering test, SWE-bench Verified, at 93.9%. Both are at or near the top of the public leaderboards.
On GPQA Diamond, all frontier models now sit in the 91-95% range.
The scores are less a record than a wall. On the science test, Mythos beats Google's Gemini and Anthropic's own prior model by a fraction of a point; every frontier model now lands between 91% and 95%, while the humans the test was built to stump stay at 65%. The number that matters isn't who is on top by a few tenths — it's that the leaders are now bunched against the ceiling, and the test no longer separates them. A confounder makes even the wall blurry: an OpenAI audit of the hardest unsolved coding problems found most had flawed test cases, so a high score may partly reward gaming the grader.
The sharper fact is who holds the top line. Mythos is not for sale. Anthropic withheld it from general release because its ability to find software vulnerabilities on its own is judged too dangerous to ship, handing it instead to roughly fifty vetted defensive-security partners. The most capable model anyone has measured is the one no one can buy — and by June a successor had already crept past these figures, making April's high-water mark a footnote within two months.
When the best models cluster at the top of every test a fraction of a point apart, the tests stop being a useful way to rank them. The scores keep climbing; what they measure gets thinner.
The lenses
The facts
How this connects
Tap a node to open it