AI agents can beat doctors in clinical decision-making

Two AI large language models (LLMs) have shown they can match and even surpass doctors in virtual testing, although their developers say it is too early for them to be used in real-world settings.

The agents – one called MIRA, which was developed by academic researchers in Germany and Google's AMIE agent – are described in two papers published in the journal Nature.

The studies provide evidence that LLMs have potential as broad toolkits, coming up with diagnoses, handling patient management, and conceiving care plans, for example, that extend beyond the typically narrow tasks they are currently used for in medicine, such as diagnostic support.

Drawing on data from patient histories and laboratory, imaging, and microbiology tests, MIRA (Medical Intelligence for Reasoning and Action) was found to have a diagnostic accuracy across eight test conditions that was similar to that of two groups of mixed-experience and board-certified physicians, and was better on some; notably, pancreatitis.

MIRA also proved to be better at some other tasks, like correctly ordering surgical procedures, managing intravenous fluid, and painkiller use, while 99.8% of its medicine recommendations were deemed to be correct and therapeutic decisions were more closely aligned with clinical guidelines.

AMIE (Articulate Medical Intelligence Explorer), meanwhile, was compared to primary care physicians and was found to generate higher-rated and more precise treatment and investigation plans, which met the criteria for non-inferiority, but were numerically superior across multiple measures.

Both research teams concluded that, while promising, the LLMs will need to be validated in prospective studies before their potential can be realised in real-world clinical practice, particularly as there were cases of divergence from recommended practice.

One commentator on the studies, health informatics and data science specialist Prof Julie Jacko of the University of Edinburgh in the UK, said both were rigorously conducted, but are demonstrating performance in a simulated environment that cannot fully capture "the complexity of real clinical decision-making."

That sentiment was echoed by Oxford sociologist, Prof Catherine Pope, who said the studies are "some remove from the messy, complex, human world of everyday healthcare" where doctors must often contend with incomplete and sometimes conflicting data.

She added: "Use in the real world will need to be in partnership with clinicians: these technologies are unlikely to replace doctors, and many will contend that they crucially do not and cannot substitute for the essential human aspects of care."

Giving a clinician's perspective, cardiologist and scientist Eric Topol, director of the Scripps Research Translational Institute, said in a blog post that a crucial consideration is that both MIRA and AMIE are text-only AIs, "meaning all the other things that are part of medicine, from the patient's non-verbal communication and tone of voice to the review of actual medical images, were not included."

He added: "The [...] LLMs will keep getting better. In fact, the ones used in these [two] reports are already obsolete. You can think of MIRA and AIME as a major step forward within the constraints of a simulation, not real medicine. But the improvements in AI's capabilities are coming fast, and it would not be surprising to see some of the benefits here extended to the actual practice of medicine."