Nature Medicine study finds general-purpose LLMs outperform specialized clinical AI on medical benchmarks

VIEWS149.6KBOOKMARKS406LIKES1K

Medicine discovers the bitter lesson: frontier LLMs (here GPT 5.2, Opus 4.6, Gemini 3.1) outperform specialized "clinical AI" (e.g. OpenEvidence) in a blind test.

Even funnier that hospital IT are more likely to approve the *specialized* versions despite them being worse.

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

15h149.6K1K406

RETWEETS365

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

15h447.3K1.4K808

REPLIES35

Ethan Mollick@emollick

There has been a push to use OpenEvidence AI for doctors. But this paper suggests general models are much better: “Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ.”

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

14h56.5K369147

Eric Topol@EricTopol

>65% of US physicians use OpenEvidence, with 27 million prompts in April https://www.nbcnews.com/tech/tech-news/openevidence-ai-doctor-medical-physician-login-app-what-npi-uptodate-rcna341064

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

15h31.5K9231

Eric Topol@EricTopol

Here is the performance breakdown for each model's blinded assessment for 4 major tasks: (1) clinical correctness, (2) completeness, (3) safety, and (4) clarity.

Eric Topol@EricTopol

This exemplifies the paradox of medical AI implementation https://erictopol.substack.com/p/the-paradox-of-medical-ai-implementation

9h7.9K2913

Ethan Mollick@emollick

Why this is a big deal.

Eric Topol@EricTopol

>65% of US physicians use OpenEvidence, with 27 million prompts in April https://www.nbcnews.com/tech/tech-news/openevidence-ai-doctor-medical-physician-login-app-what-npi-uptodate-rcna341064

10h10K2712

Ramez Naam@ramez

I'm not all that surprised by this. Sutton's bitter lesson tells us that generalist models trained on far more data outperform narrower models with less data.

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

14h5.4K408

Rohan Paul@rohanpaul_ai

A Nature Medicine study found general-purpose LLMs are now outperforming dedicated medical AI products on physician-reviewed clinical tasks.

The authors compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on medical exam questions, clinician-style answers, and real questions doctors asked during care.

In 100 de-identified physician questions from live clinical use, blinded clinicians again preferred the frontier models, especially on completeness and clarity,

7h2.1K297

Neal Khosla@nealkhosla

This was always going to be the case and continues to befuddle that “sophisticated” investors think proprietary data will help.

That’s only true in very specific areas.

But the marketing value of domain specific data still matters to end customers.

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

12h10.3K239

Nabeel S. Qureshi@nabeelqu

"Experts" really do not want to believe this (see Topol's "this was not anticipated", even though this is just Rich Sutton 101), nor do IT departments, but they'll learn eventually I guess

Nabeel S. Qureshi@nabeelqu

Medicine discovers the bitter lesson: frontier LLMs (here GPT 5.2, Opus 4.6, Gemini 3.1) outperform specialized "clinical AI" (e.g. OpenEvidence) in a blind test.

Even funnier that hospital IT are more likely to approve the *specialized* versions despite them being worse.

14h2.8K373

Blake Byers@byersblake

@ramez Actually if you look at the evals at the end, OE wins on accuracy and only loses on writing style (weird eval to include) and follow up prompt.

Ramez Naam@ramez

I'm not all that surprised by this. Sutton's bitter lesson tells us that generalist models trained on far more data outperform narrower models with less data.

2h67184

Kyunghyun Cho@kchonyc

if you're busy, all you need to look at in my opinion are fig. 2 (c-d)

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

9h1.9K13

Eric Topol@EricTopol

Yes. Widely relying on this for patient diagnosis and management w/o prospective, rigorous assessment for real world tasks until now. As the authors point out: "scale, alignment and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency for particular tasks"

Ethan Mollick@emollick

Why this is a big deal.

9h79883

The Nurse Engineer🇳🇬@boochi_dot_dev

For the past 18 months, I’ve worked on LLMs in Nursing; and I can say that generalist models (GPT-5, Gemini 3pro) always beat health specialist LLMs on my benchmarks and there are two reasons for this:

1) Reinforcement Learning: Because factual reasoning is distilled into general LLMs through on-policy RL with verifiable rewards, they often outperform specialised LLMs that were often usually adapted through LORA supervised fine tuning off-policy.

2) Parameter Size: Most general LLMs have weights to the magnitude of several trillion parameters, and will always make them better that custom-domain one’s with smaller parameters

11h2012

Raj Movva@rajivmovva

But I'm not sure how much to trust these results. OE has lots of MDs in-house and I'd be surprised if they are shipping a product that their customers really think is worse. And ultimately this is a pretty non-verifiable domain, where what doctors think is the only real eval.

7h3713

Eric Topol@EricTopol

This exemplifies the paradox of medical AI implementation https://erictopol.substack.com/p/the-paradox-of-medical-ai-implementation

13h1611

Raj Movva@rajivmovva

Last thought: is MedGemma actually frontier at anything? It seems like this paper is largely an indictment of specialist harnesses, is the same thing true for specialist LLMs?

7h2213

Farhad Nassiri Afshar, MD@DrNassiriAfshar

Nature Medicine just reported a remarkable result: general-purpose frontier AI models from Google, OpenAI, and Anthropic outperformed specialized medical AI tools, including OpenEvidence and UpToDate Expert AI, across MedQA, HealthBench, and blinded clinician-rated real clinical queries.

This is AI democratizing expertise in real time.

The old moat was access to specialized knowledge.

The new moat is judgment, validation, safety, and responsible deployment.

Not a replacement for physicians, a redistribution of reasoning power.

We are not watching a software update.

We are entering a technological revolution.

#AIinMedicine #MedicalAI #HealthTech

13h521

Rohan Paul@rohanpaul_ai

https://www.nature.com/articles/s41591-026-04431-5

Rohan Paul@rohanpaul_ai

A Nature Medicine study found general-purpose LLMs are now outperforming dedicated medical AI products on physician-reviewed clinical tasks.

The authors compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on medical exam questions, clinician-style answers, and real questions doctors asked during care.

In 100 de-identified physician questions from live clinical use, blinded clinicians again preferred the frontier models, especially on completeness and clarity,

7h94340

Hannah Abrams, MD@HannahRAbrams

@nabeelqu This is 12 clinicians, of un-reported specialties & training, rating the responses of an LLM to *someone else’s* question, potentially in a totally different area of medicine. Not sure this tells us much but that frontier models are better at generating widely agreeable answers.

11h253

Nature Medicine study finds general-purpose LLMs outperform specialized clinical AI on medical benchmarks

Story Overview

Scaling keeps winning on narrow tasks

Real-world checks still needed before clinics