Solving the Big Data problem in pharma innovation

Gunjan Bhardwaj

The effectiveness of AI applications can be undermined by the volumes of unstructured data prevalent in the pharma industry. What can be done to overcome this issue? 

We live in an exciting time for the pharmaceutical industry. Cutting-edge technologies like artificial intelligence (AI) and Blockchain are making headlines or revolutionising everything from drug discovery to clinical trials. Many of these innovations are built upon the same foundation: Big Data. But a longstanding challenge within Big Data must be overcome in order for technologies like AI to achieve their full potential. That challenge is unstructured data.  

Unstructured data and pharmaceutical AI  

The need to overcome this challenge can be illustrated by examining the consequences of unstructured data for the effectiveness of AI applications within the pharmaceutical and life science industries.  

As I’ve written about in the past, the history of AI can be seen through the lens of three distinct waves. The first wave brought ‘knowledge engineering’ software that enabled efficient solutions to practical challenges. The second wave brought machine learning programs that enabled automated pattern recognition and advanced statistical analysis. We’ve now entered the third wave of AI, which has the power to generate novel hypotheses by analysing massive sets of data.  

Third-wave AI has the potential to significantly accelerate the research and development process for new drugs, as companies like Merck & Co and Sanofi have begun to discover. Applications of third-wave AI programs have powered medical discoveries such as the connection between fish oil and Raynaud’s disease. 

But third-wave AI applications have also suffered a series of failures in healthcare and pharmaceutical contexts. MD Anderson’s problems with IBM Watson serve as a notable example. In that instance, the problems all started when MD Anderson changed its electronic medical record (EMR) provider, preventing Watson from accessing the data that it needed. This example illustrates the challenge posed by unstructured data and the corresponding need for greater data integrity within life science industries.  

Data integrity in life sciences 

Many of today’s AI programs depend on good, clean data in order to operate effectively. If access to such data is compromised, the AI program’s ability to conduct analysis and generate hypotheses is undermined.  

Data sets within the pharmaceutical and life science industries pose a particular challenge for AI programs because of the unusual density, depth, and diversity of biological data. Because the complexity of biological data renders it incomprehensible to many AI programs, the majority of pharmaceutical research today is carried out manually. Human researchers curate data, generate hypotheses, and perform experiments in much the same way that they have for decades. Lacking automation, the drug discovery, development, and testing process is inefficient, expensive, and often inaccurate.  

The inefficiency of this process causes prolonged delays between the completion of an experiment and the publication of its results in scientific journals or databases. This delay has resulted in a significant problem with publication bias and inaccuracy in the industry. Even the open-science movement, which is attempting to increase access to not-yet-published clinical research results, depends on manually-curated datasets that are usually created by companies with proprietary interests.  

Even heavily-curated data sets are often too inconsistent to be meaningfully analysed by AI. Take, for example, the challenge posed by abbreviations and acronyms within the pharmaceutical industry. The same abbreviation may carry different meanings depending on its context. ‘Ca’, for instance, could mean ‘cancer’ in one context and ‘calcium’ in another. Most AI depends on accurate and nuanced contextual information, and manually-curated data sets often fall short of this mark.  

Overcoming the unstructured data challenge 

Fortunately, some of the world’s leading firms have begun to explore two possible ways to overcome these challenges. One approach is to simply improve the state of available data sets. 2009’s HITECH Act modelled this approach by standardising EMR systems to create richer, more comprehensive, and more up-to-date, biological data sets. As a result, diverse data from biological patents, clinical trials, academic theses, and other sources can increasingly be analysed by advanced AI programs.  

The second way to overcome the unstructured data challenge is simply to build better AI. Recent innovations have brought ‘context normalisation’ AI technology that can process and analyse unstructured, heterogeneous data points using a combination of natural language processing, machine learning, and cutting-edge text analytics. Finally, the most advanced AI programs are able to utilise disparate, incongruous data to generate novel hypotheses without the need for costly human curation.  

Innovations like these are allowing researchers to analyse data, generate hypotheses, and conduct conclusive clinical trials at unprecedented levels of speed and accuracy. This is good news for pharmaceutical companies, medical professionals, and consumers alike.  

About the author: 

Gunjan Bhardwaj is the founder and CEO of Innoplexus, a leader in AI and analytics as a service for life science industries. With a background at Boston Consulting Group and Ernst & Young, he bridges the worlds of AI, consulting, and life science to drive innovation. 

Read more: 

The third wave of AI in pharma R&D