Analysis of 21 models shows gaps in early-stage decision-making, reinforcing need for clinician oversight as AI tools enter care settings.
A new study from Mass General Brigham researchers finds that while large language models (LLMs) can often arrive at correct diagnoses when given complete patient information, they continue to fall short in clinical reasoning—particularly when working through cases with limited data.
The study,published in JAMA Network Open, evaluated 21 publicly available AI models across a series of clinical scenarios, assessing their ability to generate differential diagnoses, recommend testing, and arrive at final diagnoses. Results showed that although models achieved more than 90% accuracy in final diagnoses when provided with full case details, they struggled with earlier, reasoning-driven steps in the diagnostic process.
To better assess performance, researchers developed a new benchmarking method called PrIME-LLM, which evaluates how models perform across each stage of clinical reasoning rather than averaging results.
“Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment,” says corresponding author Marc Succi, MD, executive director of the MESH Incubator at Mass General Brigham, in a release. “Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available—not always the case.”
Across the study, models failed to produce appropriate differential diagnoses more than 80% of the time. Researchers noted that while models improved as more clinical data—such as lab results and imaging—was introduced, they consistently struggled in the early stages of a case when information was limited.
“By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor,” said lead author Arya Rao, a researcher with the MESH Incubator and MD-PhD student at Harvard Medical School, in a release. “These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.”
The researchers say the findings highlight the importance of maintaining clinician oversight as AI tools are introduced into clinical workflows, particularly in decision-making stages that rely on incomplete or evolving information.
ID 123793148 © Wrightstudio | Dreamstime.com