How and why navigating unstructured data in biopharma industry is gaining attention

Will Zerhouni

i3 Analytics

In today’s environment, navigating biopharma data is fast becoming more and more necessary to meet competitive demands while there remains a powerful push to innovate while the dollars to do so are being pulled away. I3 Analytics CEO Will Zerhouni details the whys and the hows here.

The biopharma industry and academic research centers combine to produce hundreds of thousands of data points annually with reports to the FDA and other agencies and hundreds of academic journals. Sorting through all the data points, the investigators leading research and studies and the news accounts of their progress and accomplishments can be overwhelming.

&nbsp,

"Mining biopharma data is fast becoming more and more necessary to meet competitive demands..."

&nbsp,

For example, there are more than 30,000 trials related to cancer listed on clinicaltrials.gov, and 10,000 of them are active. Of these, 168 are studies of pancreatic cancer in open enrollment. There are also 61 trials studying tegafur. When you whittle it down even more, there are three open trials examining both tegafur and pancreatic cancer.

American and European governments, philanthropic foundations and the pharmaceutical industry are pouring billions of dollars into bench and clinical research while different branches of the same governments, health insurers and consumer groups are demanding greater transparency, reduced costs and improved care. Mining biopharma data is fast becoming more and more necessary to meet competitive demands while there remains a powerful push to innovate at the same time as the dollars to do so are being pulled away.

We’re just starting to see the value of big data in the biopharma industry, with the development of research data standards, multi-company collaboratives and investment dollars fostering growth in the space.

Big data brings big investment

In September 2012, ten of the world’s leading biopharma companies launched TransCelerate Biopharma Inc., a non-profit designed to accelerate the development of new medicines. Among the top priorities for this industry collaborative – comprised of the likes of AstraZeneca, Bristol-Myers Squibb, GlaxoSmithKline, Pfizer, Roche and Sanofi and others – is the development of industry-wide data standards to “support the exchange and submission of clinical research and meta-data, improving patient safety and outcomes.” TransCelerate purports to develop a shared investigator portal to push shared content and expertise, too.

Several multi-million-dollar deals in the biopharma big data space have also occurred recently. Elsevier bought Paris-based Aureus Sciences, Ariadne Genomics and ExitCare. Cloudera of California last year secured $65 million in venture capital funding to push its big data offering in Europe. And, this year, Stanford start-up Ayasdi raised $10.25 million in a Series A venture funding.

&nbsp,

"The data needs to be presented in a way that’s easy to use and understand, and it needs to be powerful enough to provide insight and intelligence..."

&nbsp,

Making sense of unstructured data

Ultimately, pouring through these data, published papers and media coverage of a compound or mechanism of action and putting that information in an easy to navigate map or snapshot will foster innovation and discovery, ultimately improving patient outcomes and reducing costs. We need to ensure, however, that we are able to apply structure and accurately analyze data for the sake of the investigators, researchers and R&amp,D decision makers. The data needs to be presented in a way that’s easy to use and understand, and it needs to be powerful enough to provide insight and intelligence when presenting the landscape of a certain disease, compound or mechanism of action.

There are two important elements necessary for navigating through unstructured data: deep linking and natural language processing. Data is detailed with large degrees of variation, as clinical trial records may have a field that contains the medical conditions they are studying. For pancreatic cancer, we often see differences in spelling and word order such as “cancer of pancreas” versus “pancreas cancer”. Also, there is the need to pick up closely related terminology to maximize the search to include pancreatic neoplasms or bile duct cancer.

That’s why the key for successful data analytics lies in deep linking and natural language processing. Platforms must examine abstracts, manuscripts and published articles and extract entities, terms and conditions. And there is a need for algorithmic ontology that puts together synonyms. In other words, HIV is the same as human immunodeficiency virus and GSK is the same as GlaxoSmithKline. So, if you want to know about GSK’s work in HIV or human immunodeficiency virus, you need to accesses the data quickly across multiple documents.

&nbsp,

"...the key for successful data analytics lies in deep linking and natural language processing."

&nbsp,

Successful natural language processing techniques and algorithms go through the unstructured text of millions of documents and pull out various entities related to companies, people, diseases, molecules, and mechanisms of action. That allows linkage that isn’t possible from a full document retrieval system. If any news story, clinical trial or publication are all talking about the same thing, they get linked together to provide a complete landscape of your search.

The biopharma industry now is demanding platforms that can navigate all of its unstructured data. In June, when Ernst &amp, Young released its Beyond Borders: Global Biotechnology Report 2012, Glen Giovannetti, Ernst &amp, Young’s Global Life Sciences Leader, declared, “More than ever, the industry needs to remove duplication, encourage pre-competitive collaboration, pool data and allow researchers to learn in real time.”

To achieve that, these researchers need fast, easy-to-navigate platforms that bring structure to these pools of data. Without effective and efficient deep linking and natural language processing, unstructured data remains just that, unstructured.

About the author:

Will Zerhouni, President and Chief Executive Officer, i3 Analytics

Will founded i3 Analytics on the belief that data analytics should be for everyone, not just the few. With a BA in Economics from Harvard University and a JD from Harvard Law School, Will has worked in private practice on patent litigation for innovator pharmaceutical companies and advised biopharmaceutical and health care companies on patent portfolios and intellectual property.

How can pharma successfully organise its clinical trial data?