Google chatbot tops docs on accuracy, bedside manner
An experimental medical chatbot based on a large language model (LLM) developed by Google has been shown to be more accurate than primary-care doctors in diagnosing some illnesses and scored higher on empathy in a new study.
The LLM, called AMIE (Articulate Medical Intelligence Explorer), was trained specifically to carry out patient consultations and so was optimised for “diagnostic dialogue”, according to the scientists behind the study, which has been published in pre-print form on the arxiv.org website.
The team – which includes scientists from Google and Deepmind – stresses, however, that the conversational AI remains experimental and is in no way ready to replace the usual patient-doctor interaction.
AMIE has been designed to handle medical history-taking and diagnosing conditions, with an emphasis on management reasoning, communication skills, and empathy, according to the team. Part of the training involved making the AI play the role of a patient with a medical condition, as an empathetic doctor, and as an evaluating mentor – to analyse the interaction from all sides.
The blinded study compared the AI to board-certified primary care physicians who carried out text-based consultations across 149 healthcare scenarios using patient actors. Neither group was told whether they were chatting to an AI or a doctor.
The results were compared by specialist doctors on diagnostic accuracy and by the patient actors on a range of conversation quality parameters, including politeness, making the patient feel at ease, listening to them and understanding their concerns, explaining the condition and treatment effectively, appearing honest and trustworthy, and instilling confidence.
The differential diagnoses provided by AMIE were more accurate and complete than those provided by board-certified doctors on the case studies, spanning six medical specialties, including cardiovascular and respiratory conditions. The AI achieved better results on 28 of 32 measures from the perspective of the specialists and gathered comparable levels of patient information.
AMIE surpassed physicians in conversation quality, according to both the specialists and the patient actors, which scored it as better on 24 of 26 measures and non-inferior on the remainder.
One key differentiator was the lengthy and detailed responses provided by the AI. The researchers note that “this could potentially suggest to an observer that more time was spent preparing the response, analogous to known findings that patient satisfaction increases with time spent with their physicians.”
The authors note that this simulated test is a long way from how it may perform with genuine cases in a real-world scenario – for example, most of the physicians were unfamiliar with using text messaging for consultations. The results are, however, an early sign of how LLM might be deployed at scale for initial interactions with patients.
“The utility of medical AI systems could be greatly improved if they are better able to interact conversationally, anchoring on large-scale medical knowledge, while communicating with appropriate levels of empathy and trust,” they conclude.
“This research demonstrates the significant potential capabilities of LLM-based AI systems for settings involving clinical history-taking and diagnostic dialogue.”
Photo by Priscilla Du Preez on Unsplash