The value of unstructured data for drug companies
The success of pharmaceutical companies in developing drugs for treating and curing medical conditions is largely dependent on the quality and completeness of data they gather from clinical trials and real-world use. In particular, unstructured data can be of significant value to pharma companies.
Unstructured data comprises most of all data in electronic health records (EHRs), but unlike structured data (e.g., medical codes, discrete measurement values, etc.), it doesn’t reside in neat, standardised fields. Instead, it’s found in clinician notes, diagnostic summaries, referral letters, and patient communications – often in free-text or image format. This data provides deep insights into a patient’s medical history, comorbidities, disease progression, and treatment responses that pharmaceutical researchers can use to accelerate the development of drugs that improve and save lives.
Historically, this type of information has been difficult and costly to extract at-scale. Manual chart reviews are labour-intensive and error prone; and early-generation software tools like generic natural language processing (NLP) fall short when applied to the domain specific language of healthcare. Without the ability to integrate and cross-reference unstructured content across disparate systems, pharmaceutical researchers have struggled to identify more subtle or longitudinal patterns that could inform drug development.
Fortunately, advancements in medical AI now allow pharmaceutical companies to leverage unstructured clinical data at scale.
Untapped potential
Roughly 80% of data stored in EHRs is in unstructured form, which means it is stored in formats that are not readily searchable or analysable by traditional data tools. Examples include scanned faxes, PDF documents, dictated discharge summaries, and clinician notes (many of which include specialty-specific acronyms and other shorthand). This unstructured data could be invaluable both during the clinical trial recruitment process and in post-marketing studies that rely on real-world data (RWD).
For pharmaceutical companies, this data holds immense value – especially during critical phases such as clinical trial recruitment, safety monitoring, and post-market research. For instance, the presence of subtle clinical indicators could identify candidates for trials earlier. Similarly, real-world data (RWD) on side effects, comorbidities, or medication adherence – often buried in free-text form – can provide more complete safety profiles during post-marketing surveillance.
The ability to extract, analyse, and correlate unstructured data across diverse sources unlocks the true power of real-world evidence by providing pharma researchers with insights about disease progression (among individuals and across patient populations) and patient response to treatments. With the right tools, pharma can move beyond surface-level analytics and gain deeper, more personalised insights into how therapies perform across diverse patient populations and care settings.
NLP and LLMs: The difference
Unstructured data has been underutilised in healthcare because traditional NLP algorithms have lacked the contextual intelligence needed to accurately understand and interpret medical data. Similarly, general-purpose large language models (LLMs) trained on high volumes of generic content struggle with the specialised vocabulary and syntactic patterns of medical language. These models are also prone to producing hallucinations (fabricated outputs), which is unacceptable in a precise field like pharmaceuticals.
While NLP long has been used in healthcare for tasks like concept extraction and codification, it’s most effective when working with structured or templated content. Given its inability to interpret nuanced clinical information, NLP has limited value when applied at scale. And, in pharmaceutical research, scale is the name of the game.
Conversely, LLMs are all about scale. They are proficient at processing and summarising large amounts of data and responding to open-ended questions. However, training and fine-tuning LLMs to operate reliably on clinical content is costly, requires terabytes of data, and often still falls short in terms of accuracy and reliability.
Real-world pharma use cases
Fortunately, medical AI models purpose-built for specialised use cases are now available for pharmaceutical companies and other healthcare organisations. Instead of being trained on large volumes of generic data, most of which is clinically irrelevant, medical AI models are trained on specific data to perform targeted tasks, such as extracting structured outputs from EKGs, blood tests, ejection fraction tests, angiograms, and other information.
Medical AI’s capabilities allow pharmaceutical companies to identify adverse events or side effects not captured in controlled clinical trials, but emerge during real-world use, such as rare side effects, interactions with comorbid conditions, or variations in response based on demographics or lifestyle factors. Using this real-world evidence (RWE), drug companies can modify dosage guidelines and update risks profiles. RWE is also useful for pharmaceutical companies seeking regulatory approval of a new indication for a previously approved drug or in meeting post-approval study requirements.
In addition, medical AI’s ability to extract data from multiple sources affords drug researchers insights into the impact of social determinants of health (SDoH) on patient compliance, clinical outcomes, and possible side effects.
Barriers to adoption in pharma
Disappointing experiences with predecessors such as NLP and LLMs have made many pharmaceutical companies resistant to medical AI. This reticence typically is exacerbated by lack of in-house AI expertise. And while many pharma organisations are hiring AI teams, they still may be undecided about whether to build capabilities in-house or partner with external vendors.
Other barriers to adoption of medical AI include data access and integration challenges (much unstructured clinical data is siloed, fragmented, or firewalled) and concerns about data integrity and regulatory scrutiny around data auditability and traceability.
Keeping humans in the loop
To ensure data quality, it is essential that pharmaceutical companies deploying medical AI use clinical experts to verify the accuracy of data. A “human in the loop” can guard against hallucinations by using their experience and expertise to flag suspect data or misinterpretations.
Trained clinicians understand medical terminology within the context of a specific specialty practice. Thus, if an AI output for a pharmaceutical company collecting RWD for a blood-thinning medication refers to “PT”, a human in the middle would know this stands for a prothrombin time blood test, rather than physiotherapy.
Another role human clinical experts can play for pharmaceutical companies deploying medical AI is ensuring alignment of outputs with therapeutic goals and outcomes.
When leveraged effectively, unstructured data offers pharmaceutical companies a powerful competitive advantage across the drug development lifecycle – from clinical trial optimisation to regulatory approval and real-world performance monitoring.
Pharmaceutical companies that invest now in purpose-built medical AI platforms can accelerate clinical trials and drug discovery, allowing them to innovate and stay ahead of competitors while delivering life-changing treatments to patients.
About the author
Dr Tim O’Connell is the founder and CEO of emtelligent, a Vancouver-based medical NLP technology solution. He is also a practicing radiologist, and the vice-chair of clinical informatics at the University of British Columbia.
