Artificial Intelligence high · independent

Performance of a large language model on the reasoning tasks of a physician

When this study reached peer review, one word vanished from its own title — and the data explains why.

paper Peter Brodeur, Adam Rodman, Arjun Manrai et al. (Harvard Medical School / Beth Israel Deaconess) · 2 min read

The Harvard and Beth Israel team that has spent two years testing whether chatbots can out-reason doctors finally landed the capstone in Science: across six experiments, OpenAI's o1-preview matched or beat attending physicians at diagnostic reasoning, including on 76 real emergency-room cases pulled from raw hospital records rather than tidy textbook vignettes.

The preprint was titled 'Superhuman performance of a large language model on the reasoning tasks of a physician.' Peer review cut it to 'Performance.'

The headline result holds up. Given only what a nurse knows at the triage desk — the information-poorest moment in a visit — the model put the right diagnosis on the table 67% of the time versus 55% and 50% for the two human doctors. It was strongest exactly where the humans had the least to go on.

But the team's December 2024 preprint had called this 'superhuman performance.' The peer-reviewed version is titled, simply, 'Performance.' The walk-back is in the data: the model's edge clusters at cheap-to-be-wrong moments — early triage, rare puzzle cases — and evaporates on the 'cannot-miss' calls, the life-threatening diagnoses where being right is the whole job. There it beat neither ChatGPT nor the doctors. The real-ER test was also a retrospective chart review; the model never managed a live patient, and its human comparators were internal-medicine doctors, not ER specialists.

So the finding that survived review is narrower and more interesting than the one that didn't: a reasoning model is most useful as a second opinion precisely when a clinician is flying blind, and least trustworthy on the emergencies where a confident wrong answer does the most harm.

The lenses

Novelty 2

Impact · breadth 3

Impact · depth 3

Actionable 1

Substance 5

Hype 4

The facts

Peer-reviewed?Yes — Science, April 2026; the result first appeared as a preprint in December 2024

Tested on real cases?76 real ER cases, but as after-the-fact chart review — no live patient care

Ready for the clinic?No. The authors call for prospective trials and say it can't replace physicians

Concepts

Human–AI collaboration Clinical Reasoning

Open science.org →

How this connects

Tap a node to open it

Performance of a large language model on the reasoning tasks of a physicianArtificial Intelligence Longevity & Health OpenAI o1-preview Human–AI collaboration Clinical Reasoning Demis Hassabis Elon Musk Google Jensen Huang NVIDIA Eli Lilly Alphabet Amazon Google Research Hugging Face ChatGPT DeepMind Andrej Karpathy Alibaba AMD

Performance of a large language model on the reasoning tasks of a physician

The lenses

The facts

Concepts

More in Artificial Intelligence

Safety's rounding error

The Jevons bill comes due

Money stopped being the bottleneck

How this connects