Analysis of 21 models shows gaps in early-stage decision-making, reinforcing need for clinician oversight as AI tools enter care settings.
A new study from Mass General Brigham researchers finds that while large language models (LLMs) can often arrive at correct diagnoses when given complete patient information, they continue to fall short in clinical reasoning—particularly when working through cases with limited data.
The study,published in JAMA Network Open, evaluated 21 publicly available AI models across a series of clinical scenarios, assessing their ability to generate differential diagnoses, recommend testing, and arrive at final diagnoses. Results showed that although models achieved more than 90% accuracy in final diagnoses when provided with full case details, they struggled with earlier, reasoning-driven steps in the diagnostic process.
To better assess performance, researchers developed a new benchmarking method called PrIME-LLM, which evaluates how models perform across each stage of clinical reasoning rather than averaging results.
“Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment,” says corresponding author Marc Succi, MD, executive director of the MESH Incubator at Mass General Brigham, in a release. “Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available—not always the case.”
Across the study, models failed to produce appropriate differential diagnoses more than 80% of the time. Researchers noted that while models improved as more clinical data—such as lab results and imaging—was introduced, they consistently struggled in the early stages of a case when information was limited.
“By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor,” said lead author Arya Rao, a researcher with the MESH Incubator and MD-PhD student at Harvard Medical School, in a release. “These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.”
The researchers say the findings highlight the importance of maintaining clinician oversight as AI tools are introduced into clinical workflows, particularly in decision-making stages that rely on incomplete or evolving information.
ID 123793148 © Wrightstudio | Dreamstime.com
As unsurprising as this is to me, I have to wonder how much of the public is aware of reports like this. Much of the public outcry focuses on people losing jobs from AI; relatively little brings up stories like this,
And consider the marketing of medical devices that somehow incorporate AI. Being retired all I see is that the devices use it;it is never clear to what purpose.
I’m reminded of the first ICU ventilator that incorporated a microprocessor. At the time I was developing some non-job related development using a Motorola 6800 uP and quite familiar with the emerging technology. The company marketed it as a “microprocessor ventilator”.
When the sales rep first showed up to attempt to sell it to us, I asked him why in the world anyone would want to ventilate a microprocessor.
A device incorporates AI? So what? What kind of AI? To what end? How does the manufacturer differentiate it from predicate devices? Was it awarded a 510k, which would mean it was claimed to be substantially equivalent to pre-1976 MDAs? (or maybe the law has changed?)
I would love to know how a device that incorporates AI to a menaingful degree can be “substantially equivalent” to devices manufactured five years ago, let alone fifty.