Analysis warns of potentially biased research into AI diagnostic tools

News
Artificial neuron in concept of artificial intelligence. Wall-shaped binary codes make transmission lines of pulses and/or information in an analogy to a microchip. Neural network and data transmission.

New research has backed the potential of artificial intelligence (AI) as a diagnostic tool, but warned that the patchy nature of available literature may have led to exaggerated claims about its performance so far.

Findings came from a systematic review and meta-analysis of all available evidence by researchers from Birmingham Health Partners, an alliance between the University of Birmingham and two of the city’s hospitals that aims to bring healthcare innovations to the clinic.

Writing in The Lancet Digital Health the researchers said there were only a small number of high quality studies to draw on, and therefore the true power of AI remains unclear.

Authors called for higher standards of research reporting to improve future research, and only then will the true diagnostic power of deep learning – use of algorithms, big data, and computing power to emulate human learning and intelligence – be revealed as a diagnostic tool.

They also noted that concerns have been raised about whether study designs so far have been biased in favour of machine learning, and whether they are applicable to real-world clinical practice.

Authors included 82 articles in the systematic review, and analysed data from 69 articles which contained enough data to calculate test performance accurately.

Pooled estimates from 25 articles that validated results in an independent subset of images were included in the meta-analysis.

Analysis of data from 14 studies comparing the performance of deep learning with humans in the same sample found that at best, deep learning algorithms can correctly detect disease in 87% of cases, compared to 86% achieved by healthcare professionals.

The ability to accurately exclude patients who don’t have disease was also similar for deep learning algorithms (93% specificity) compared to healthcare professionals (91%).

But authors noted limitations in the methodology and reporting of AI-diagnostic studies – for instance deep learning was frequently assessed in isolation in a way that does not reflect clinical practice.

Authors said poor reporting was common with most studies not accounting for missing data, limiting the conclusions that can be drawn.

Dr Xiaoxuan Liu, of the University of Birmingham, said:  “There is an inherent tension between the desire to use new, potentially life-saving diagnostics and the imperative to develop high-quality evidence in a way that can benefit patients and health systems in clinical practice.

“A key lesson from our work is that in AI – as with any other part of healthcare – good study design matters. Without it, you can easily introduce bias which skews your results.

“These biases can lead to exaggerated claims of good performance for AI tools which do not translate into the real world. Good design and reporting of these studies is a key part of ensuring that the AI interventions that come through to patients are safe and effective.”

25 September, 2019