For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
Nature Medicine study finds general-purpose LLMs outperform specialized clinical AI on medical benchmarks
Story Overview
An independent Nature Medicine evaluation put three frontier general-purpose LLMs against two dedicated clinical AI platforms on medical knowledge tests, clinician alignment tasks, and real de-identified physician queries, with the broad models coming out ahead in every category after randomized blinded review by twelve US clinicians.
Scaling keeps winning on narrow tasks
Gemini 3.1 Pro reached 97.4 percent on MedQA while the specialized tools trailed, echoing earlier patterns where general models trained on broad data outperform narrow fine-tunes when the evaluation stays within benchmark limits.
Real-world checks still needed before clinics
The study stresses that benchmark wins alone do not confirm deployment safety or patient outcomes, leaving open how these models would perform under live regulatory or liability scrutiny.
Positive users celebrate frontier models outperforming specialized medical tools as another bitter lesson proving scale and generality win, while negative users call the studies dumb or attack the authors and reliability of general models.
Most Activity
Medicine discovers the bitter lesson: frontier LLMs (here GPT 5.2, Opus 4.6, Gemini 3.1) outperform specialized "clinical AI" (e.g. OpenEvidence) in a blind test.
Even funnier that hospital IT are more likely to approve the *specialized* versions despite them being worse.
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
There has been a push to use OpenEvidence AI for doctors. But this paper suggests general models are much better: “Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ.”
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
>65% of US physicians use OpenEvidence, with 27 million prompts in April https://www.nbcnews.com/tech/tech-news/openevidence-ai-doctor-medical-physician-login-app-what-npi-uptodate-rcna341064
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
Here is the performance breakdown for each model's blinded assessment for 4 major tasks: (1) clinical correctness, (2) completeness, (3) safety, and (4) clarity.
This exemplifies the paradox of medical AI implementation https://erictopol.substack.com/p/the-paradox-of-medical-ai-implementation
Why this is a big deal.
>65% of US physicians use OpenEvidence, with 27 million prompts in April https://www.nbcnews.com/tech/tech-news/openevidence-ai-doctor-medical-physician-login-app-what-npi-uptodate-rcna341064
I'm not all that surprised by this. Sutton's bitter lesson tells us that generalist models trained on far more data outperform narrower models with less data.
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
A Nature Medicine study found general-purpose LLMs are now outperforming dedicated medical AI products on physician-reviewed clinical tasks.
The authors compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on medical exam questions, clinician-style answers, and real questions doctors asked during care.
In 100 de-identified physician questions from live clinical use, blinded clinicians again preferred the frontier models, especially on completeness and clarity,
This was always going to be the case and continues to befuddle that “sophisticated” investors think proprietary data will help.
That’s only true in very specific areas.
But the marketing value of domain specific data still matters to end customers.
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
"Experts" really do not want to believe this (see Topol's "this was not anticipated", even though this is just Rich Sutton 101), nor do IT departments, but they'll learn eventually I guess
Medicine discovers the bitter lesson: frontier LLMs (here GPT 5.2, Opus 4.6, Gemini 3.1) outperform specialized "clinical AI" (e.g. OpenEvidence) in a blind test.
Even funnier that hospital IT are more likely to approve the *specialized* versions despite them being worse.
@ramez Actually if you look at the evals at the end, OE wins on accuracy and only loses on writing style (weird eval to include) and follow up prompt.
I'm not all that surprised by this. Sutton's bitter lesson tells us that generalist models trained on far more data outperform narrower models with less data.
if you're busy, all you need to look at in my opinion are fig. 2 (c-d)
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
Yes. Widely relying on this for patient diagnosis and management w/o prospective, rigorous assessment for real world tasks until now. As the authors point out: "scale, alignment and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency for particular tasks"
Why this is a big deal.

For the past 18 months, I’ve worked on LLMs in Nursing; and I can say that generalist models (GPT-5, Gemini 3pro) always beat health specialist LLMs on my benchmarks and there are two reasons for this:
1) Reinforcement Learning: Because factual reasoning is distilled into general LLMs through on-policy RL with verifiable rewards, they often outperform specialised LLMs that were often usually adapted through LORA supervised fine tuning off-policy.
2) Parameter Size: Most general LLMs have weights to the magnitude of several trillion parameters, and will always make them better that custom-domain one’s with smaller parameters

But I'm not sure how much to trust these results. OE has lots of MDs in-house and I'd be surprised if they are shipping a product that their customers really think is worse. And ultimately this is a pretty non-verifiable domain, where what doctors think is the only real eval.

This exemplifies the paradox of medical AI implementation https://erictopol.substack.com/p/the-paradox-of-medical-ai-implementation

Last thought: is MedGemma actually frontier at anything? It seems like this paper is largely an indictment of specialist harnesses, is the same thing true for specialist LLMs?

Nature Medicine just reported a remarkable result: general-purpose frontier AI models from Google, OpenAI, and Anthropic outperformed specialized medical AI tools, including OpenEvidence and UpToDate Expert AI, across MedQA, HealthBench, and blinded clinician-rated real clinical queries.
This is AI democratizing expertise in real time.
The old moat was access to specialized knowledge.
The new moat is judgment, validation, safety, and responsible deployment.
Not a replacement for physicians, a redistribution of reasoning power.
We are not watching a software update.
We are entering a technological revolution.
#AIinMedicine #MedicalAI #HealthTech
https://www.nature.com/articles/s41591-026-04431-5
A Nature Medicine study found general-purpose LLMs are now outperforming dedicated medical AI products on physician-reviewed clinical tasks.
The authors compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on medical exam questions, clinician-style answers, and real questions doctors asked during care.
In 100 de-identified physician questions from live clinical use, blinded clinicians again preferred the frontier models, especially on completeness and clarity,

@nabeelqu This is 12 clinicians, of un-reported specialties & training, rating the responses of an LLM to *someone else’s* question, potentially in a totally different area of medicine. Not sure this tells us much but that frontier models are better at generating widely agreeable answers.