Tech & Trials: Can AI save us from drowning in data?

Digital
data visualisation

Data – it’s everywhere! Especially in clinical trials, where an average trial of about 2,000 patients can expect to produce 3,000,000 points of data across labs, vitals, medications, and adverse events.1 (And that’s from an article that’s 25 years old.) While more data often means richer insights – emerging adverse events, response by sub-group analysis, etc. – it also introduces a host of challenges.

The most common issue is the most intuitive. The more data we collect, the higher the chance for errors. Maybe the automated biomarker from the patient’s watch didn’t trigger, resulting in a missing data point. Or the person entering data into the electronic data capture (EDC) system was typing so fast a period in the number ended up in the wrong place. Or lab values, transferred from the lab information system to the EDC, via software, had a bug that ended with someone reporting a white blood cell count of 2,000,000. In fact, one (hardly surprising) study states that about 90% of spreadsheets with 150 or more rows will contain an error - more data equals more errors.2

illustration of male at computer

Value errors and logical errors

Thankfully, AI can help.

It’s useful to break the most common data errors into two types: value errors and logical errors. Value errors are those that occur for a single data point in a clinical trial – the value could be missing, or too large or tiny, or maybe even the wrong type of data entirely (someone writes the word “date” for a “Date of Birth” field).

In contrast, logical errors in clinical trial data occur when the individual points of data are fine, but in combination they don’t make sense. As one example, maybe someone’s recorded Systolic blood pressure is 71 and the Diastolic blood pressure was reported as 112. Individually, the data points seem ok, but together they would indicate an error because the Diastolic value is much higher than the Systolic, which shouldn’t happen. This is a logical error since the “formula” of combined values indicates something is wrong.

There are multiple ways we can leverage AI to solve both value and logical errors. Going back to the original column in this series, the concepts of machine learning and “classic” AI were introduced, and both of those techniques are useful here.
One way to find potential value errors is to use machine-learning to predict if a value is much too large or much too small, given all the other values you’ve seen before.

Essentially, you learn a “model” for each data point that narrows into a range of acceptable values, given all the values it has been fed in training. Then, if a new value comes in, the model can predict whether it seems reasonably in range.

Screen and numbers

The importance of a Knowledge Base

One benefit of this approach is that it is purely data driven – I can provide a huge number of values for earlobe size measurements, even if that’s something super rare across all clinical trials, and this technique will figure out if a value is so small that it’s suspicious. However, it also therefore requires a large number of training values to effectively figure out which values are “in range” versus those that are not. So, it has a cost associated with “warming up” – you need to gather enough data in your trial (or have it historically) to figure out the ranges. So, the benefit is flexibility (it should work on many types of data), but the challenge is getting enough of that data to learn the ranges.

On the other hand, we can leverage classical AI to create a “Knowledge Base.” In the previous column, logic and constraints were noted as key concepts in classic AI, and they fit perfectly for cleaning data as well.

A Knowledge Base is a data set of known types of data, complete with logic and constraints. For instance, in a knowledge base of human health, you might have a concept of “Age” defined as a number between 0 and 140. The Knowledge Base should cover most types of data seen in trials – different types of labs, different types of vitals, and even demographic information. The benefit is that it can be built and reused many types of trials (“build and reuse”), but the challenge is doing it with enough fidelity and scale to cover many, many concepts in health (versus say “EDC edit checks”, which are programmed, per trial, on a limited set of data for just that trial). But again, we can use AI to automatically mine the medical literature, previous trials, etc., to build the Knowledge Base.

One interesting aspect of a Knowledge Base is that we can then introduce constraints (just like Sudoku, as per the previous column) to identify Logical errors. For instance, if the Knowledge Base knows about Systolic Blood Pressure and Diastolic Blood Pressure as individual data types, we can introduce a constraint that Diastolic Blood Pressure can’t be so much higher than Systolic. In other words, we can encode logical errors as constraints. Again, the challenge is building up this set of constraints, at scale, without being a huge burden on the developers. So, the Knowledge Base has a challenge in building it up in the first place, to cover many data types (versus the machine-learning mechanism), but it’s easier (and more intuitive) for finding both value and logical errors and can be reused again and again.

medical icons

Trials continue to collect more and more data, which is great for insights. But the first (and maybe most important) aspect of dealing with the deluge of data is making sure that it’s clean and ready for analysis. If we can use AI to do this, we can dramatically speed up our cycle times for tasks ranging from Database Lock (currently a very long and very manual task of data checking), to Analytic Insights (what adverse events are emerging in our older population?), to Risk-Based Quality Management (which sites are producing more data errors versus others).

In other words, the first step to handling the crush of data is making sure you can wrangle it into something workable - going from a huge pile of laundry to a nicely folded closet.

References

  1. https://www.ncbi.nlm.nih.gov/books/NBK224576/
  2. https://www.igi-global.com/article/know-spreadsheet-errors/55750
Image
Dr Matthew Michelson
profile mask
Dr Matthew Michelson