The irony of data in precision medicine: Igniting progress while creating challenges

R&D
digital contour lines

As I regularly engage with pharma executives and biotech founders, a common theme inevitably arises: why is precision medicine taking so long to become a reality, and is technology to blame?

There’s a critical problem that’s impeding scientific discovery for life sciences firms – the limitations of technology to handle complex multi-modal data that is the cornerstone of drug and target discovery. Put simply, life sciences firms often find themselves using technology designed for totally different applications.

Here are the three main challenges with continuing this approach:

Challenge #1 - Life sciences data is exceedingly diverse

Today’s life sciences data spans traditional, ‘everyday’ data like text files, tables, and PDFs, as well as what we call ‘frontier’ data, things like population genomics, single-cell bioimaging, and other multi-omics data. This data is further subdivided into ‘structured’ data, which most typically includes tabular data, and ‘unstructured’ data, which subsumes everything else, from text files to images – to all multimodal, multi-omics data, to ‘real-world’ data like electronic health records and disease registries.

Current estimates indicate that up to 80% percent1 of data in the life sciences industry is unstructured and comes from a wide array of sources, making it hard to analyse and utilise. Despite its vast potential, only 12% of this unstructured data is included in analysis efforts, leaving vast amounts of data untapped and essentially dormant.2

In aggregate, life sciences data is a precious, irreplaceable asset that serves as a foundation for innovation and often brings a clear competitive advantage to organisations. However, current systems are not adept at supporting it, and this is especially true for frontier data. As a result, organisations are left to implement disparate solutions for each data type, resulting in disjointed, hard to manage data empires and spiralling licensing and operational costs. To address these challenges, organisations typically adopt technologies like databases; file managers; data catalogues and specialised scientific platforms. But, standalone, these technologies cannot fulfill the requirements of organisations focused on discovery.

For example:

  • Database technology has the power to model and analyse data securely and extremely efficiently. However, the vast majority of databases can handle only structured, mainly tabular data. Despite its importance, tabular data accounts for only a small fraction of the scientific data within most organisations. Force-fitting non-tabular data into a tabular database most often results in poor performance.
  • File managers have the ability to store any type of data in binary files. However, they do not provide any context, semantics, or specialised metadata about the underlying data modalities, which makes it very difficult to search and locate data relevant to a specific scientific workflow.
  • Data catalogues add more meaningful information about the data they are cataloguing, exposing important relationships across the different data types and making searches much more effective. However, data catalogues do not have the computational power and scalability of databases.
  • Scientific platforms are specialised for the scientific domain, offering deep understanding and functionality around the data modalities used in the life sciences industry (which may include some database, file management, and data catalogue functionality). However, these solutions are not architected to be powerful database systems, which shows up on complex and demanding data modalities (such as genomics and single cell).

As a result of wide stratification across various solution types, it becomes nearly impossible to catalogue and search over all the different types of data being generated across different teams and roles. This is further complicated, considering the inconsistent technical ability across the sciences – a bench scientist may or may not know how to code in R or Python and, even if they do, if they’re a geneticist or cell biologist, this isn’t likely how they want to be spending their time.

Organisations desperately need a way to catalogue centrally – storing and managing all of their data and code modalities, such as tables, text files, multi-omics, metadata, ML models, Jupyter notebooks, user-defined functions, and more. When organisations make it easier to search over all modalities, their teams can become much more productive in their everyday work.

Challenge #2 - Life sciences data is often highly sensitive

Data and science teams must be able to collaborate, but they must be able to do this over privileged and proprietary data (patient data, which is subject to HIPAA, often falls within this category), without expensive and laborious data movements. This is very important, because if an organisation has multiple data solutions for its various data types, it’s likely going to need a large team to harmonise the use of all these systems, ensuring secure and compliant collaboration. This, combined with exploding licensing costs from having multiple systems providing the same functionality, can quickly lead to major cost overruns and lost productivity.

On the other hand, when all data is catalogued centrally, it can be strictly access-controlled, such that it can be safely shared among colleagues and even with different organisations. A new, single, centralised source of truth must be built with security, governance, and compliance top-of-mind, ensuring a trusted shared research environment. This platform should also be SOC 2 Type 2 and HIPAA compliant, and be subjected to constant penetration tests by third-party auditors.

In addition to access control, the platform should include detailed logging capabilities to record all the activity on all assets by all parties, as well as provide simple yet thorough auditing capabilities to validate security. Finally, organisations must have the ability to self-host the platform in their own private environments, delivering yet another layer of security to protect sensitive research and data.

Challenge #3 - Frontier data is by nature large and computationally intensive

Consider here, as an example, single-cell data. As the number of single-cell datasets continues to grow, workflows that map new data to well-curated reference databases offer immense potential for the life sciences community. This is very fertile ground for precision medicine discoveries and breakthroughs.

However, single cells result in huge data, with these big unstructured data volumes often ending up in bespoke, complex file formats. This can severely impact performance and slow down analysis, delaying discovery.

Simply cataloguing complex data without the ability to rapidly analyse and probe it creates a big architectural gap that hinders discovery. It is possible to put structure on complex modalities to boost performance, even on a single machine. This can be achieved by adopting shape-shifting multi-dimensional arrays, which shape-shift to any data type, thus bringing ‘structure’ to unstructured data. The benefits compound when hard computations are scaled out further from a few CPUs to hundreds or even thousands of machines. In a new approach, scalable computing can be paired with the database, eradicating the need for specifying and spinning up large compute clusters that may slow down time-critical computations and discovery if not handled properly. In addition, spinning up large compute clusters may cause a rise in total cost of ownership if they are not efficiently utilised.

To summarise, a novel approach is required to orchestrate life sciences data that can bring numerous benefits to organisations, including:

  • Faster time to insights – democratising complex data for scientists and facilitating collaboration leads to faster breakthroughs.
  • Simpler infrastructure – a unified, scalable, and collaborative platform for all modalities eliminates the need for numerous software and convoluted infrastructures.
  • Less engineering – data modelling, ingestion, fast analysis, and governance are all offered by a single platform, reducing the engineering effort.
  • Unprecedented speed – architected around cutting-edge database technology with unmatched speed and scale.
  • Greater economies of scale – immense cost savings due to performance, omni-modality, and lower engineering effort.

Life sciences leaders working in pharma and biotech must overcome the challenges associated with scientific data – its diversity, sensitivity and need for easier governance, and its heavy computational analysis demands. The life sciences require a new system of record, and the future of precision medicine is counting on it.

Image
Dr Stavros Papadopoulos
profile mask
Dr Stavros Papadopoulos