Legacy data into living scientific memory

The pharmaceutical industry is racing to turn its data into scientific intelligence with AI. Now, it’s possible to connect your data and tool ecosystem.

There is a gold rush happening in pharmaceutical R&D centred around making drug discovery and development faster and cheaper, and the industry is looking to artificial intelligence to provide the picks and shovels of that revolution. Large pharma companies are pouring capital into AI partnerships, acquisitions, and internal platforms, each racing to claim first-mover advantage in a technological arms race that many believe will redefine drug discovery. The promise is well-founded: AI has demonstrated the capacity to cut preclinical development timelines by as much as 40%.¹ In an industry where a single programme can cost billions and take more than a decade to reach patients, that kind of compression is transformative.

And yet, for all the excitement, most large pharma organisations share a one big challenge: they are trying to build the future on shaky foundations. The data problem in pharma is fundamentally a structural one, decades in the making, and no amount of model sophistication will solve it if the underlying data remains fragmented, siloed, and scientifically decontextualised.

"Scientists, and more importantly patients, don’t have time to wait for R&D data perfection. We need ways to start capitalising on AI to accelerate research TODAY."

Large pharma companies have a legacy problem. Unlike AI-native technology companies that were born in the era of structured, cloud-native data, pharmaceutical organisations carry the weight of their history. R&D data has accumulated across decades, scattered across functions and systems that were never designed to speak to each other. Omics data lives in one silo, imaging in another, all of which is disparate and misaligned. Clinical trial results, real-world evidence, and biomarker findings are each locked behind separate platforms, separate teams, and in many cases, separate cultures. The informatics teams and the scientists who generate the data have long operated in parallel worlds, each with their own tools, vocabularies, and incentives. The other challenge is abundance without integration: R&D teams are struggling to implement over 840 different tools and are bogged down in “data wrangling”: cleaning, transferring, and formatting data in order to make it useful.

Biotech has different woes. While many biotech organisations are AI native, designed from the ground up with artificial intelligence as the core, foundational component, rather than adding AI as an afterthought, they struggle with limited internal infrastructure and bandwidth, and teams are under intense pressure to demonstrate validated targets in order to secure funding. Unlike big pharma, they don’t have decades of historical lab data they can unlock to make AI agents smarter.

The paradox facing the industry is that the organisations with the greatest potential to benefit from AI - those with the richest and most varied datasets accumulated over decades of research - are often the least equipped to exploit them. Their data is vast, but not usable; broad, but not harmonised; historically rich, but scientifically decontextualised. And that is precisely why AI readiness in pharma is less about acquiring new tools and more about making legacy data useable and useful.

The solution, increasingly embraced by leading organisations, is not simply to build a larger data lake. Dumping more data into a central repository without scientific context does not solve the underlying problem - it merely relocates it. What is required is the continuous ingestion, mapping, and harmonisation of heterogeneous internal and external data into an operating system that brings it all together in a continuous cycle of innovation. This is a fundamentally different construct. What is required is not a larger data lake, but a living, curated perennial scientific memory that preserves the context in which data was generated, connects wet lab and dry lab workflows, and creates a seamless bridge between the experiment bench and the computational model.

“What is required is not a larger data lake. It is a living, curated scientific memory.”

The most forward-thinking pharma organisations are now going a step further: orchestrating AI agents that automatically retrieve, reconcile, and enrich data in response to specific R&D questions. This is highlighted by recent deals such as the NVIDIA and Lilly AI Labs co-innovation agreement, announced earlier this year. Rather than every team rebuilding analytical views from scratch for target assessment, indication expansion, or biomarker strategy, these agents do the heavy lifting, pulling from harmonised knowledge layers that have already done the hard work of integration. This shifts the role of the scientist from data curator to decision-maker, which is where their expertise most belongs.

But there is a deeper strategic dimension here that is often overlooked. The argument for AI readiness is not solely about efficiency. It is about differentiation. Every pharma company has access to the same publicly available models, the same foundational AI architectures, the same general-purpose tools. What they do not share is their proprietary, accumulated R&D knowledge - the decades of experimental data, clinical learnings, and scientific insight that are unique to each organisation. Companies that successfully embed this institutional knowledge into their AI models will not just be faster. They will be genuinely differentiated. Their models will reflect competitive intelligence that cannot be replicated, because it is derived from the specific, irreplaceable legacy of their own scientific journey.

This is why the concept of the “lab-in-the-loop” represents such a pivotal development. The ability to continuously train and refresh AI models with new experimental data - closing the feedback loop between laboratory discovery and computational inference - is the tipping point at which AI moves from a productivity tool to a unique discovery capability that in many ways reflects the data legacy of that company. When agents are updated in real time as new data flows in from ongoing programs, an agent is continuously upskilled for the task it performs whether that is predicting toxicity, ADME, etc., from being an “OK” predictor to a world class predictor. This increases confidence in in-silico agents and their pipelines as they simply get smarter over time and this becomes demonstrably true as lab work will confirm their predictions. This unlocks years of incredibly valuable wet lab work and research data into AI agent brains to eventually limit and, for some use cases, eliminate wet lab work, thereby accelerating discovery and dramatically reducing costs.

The biopharma companies that will win the AI race are not necessarily those that acquire the most advanced models or spend the most on technology. They are the ones that invest, first and seriously, in making their existing knowledge useable and insights traceable. The data already exists. The scientific heritage is already there. The competitive advantage is locked inside decades of research that, in most organisations, remains frustratingly inaccessible. Unlocking it is not glamorous work. It does not generate headlines. But it is the only foundation on which genuinely differentiated, genuinely sustainable AI in pharma can be built. The arms race is real. But the winners will be decided long before the models are ever trained.

Reference

BCG BioPharma trends, January 2025

About the author

Kevin Cramer is Founder and CEO of Sigmatic Sciences. He is driven by his dedication to accelerating the drug R&D process. He brings over 20 years of leadership experience in the life science bioinformatics industry, building machine learning ready systems for pharma R&D which brings an invaluable frontline perspective on the industry’s needs. His vision encompasses a unified platform designed to streamline the creation of large and small molecules, assess their efficacy, manage data capture at scale, and harness the power of data visualisation. The platform also incorporates advanced analytical and AI strategies to augment the discovery process.

About Sigmatic Sciences

Sigmatic Sciences logo

Sigmatic Sciences, formerly HelixAI and a proud Sapio Sciences Company, is dedicated to transforming R&D by unifying tools and autonomous processes in a single platform, enabling scientists to deliver better medicines, faster.

We have created the Sigmatic Operating System (SigmaticOS), the only complete lab-in-the-loop platform for all scientists, allowing research to move from copilot to autopilot. SigmaticOS orchestrates 100+ agents across in silico workflows, lab automation, and model training with full traceability. By integrating scientific data from public, private and client sources with unlimited autonomous computation, SigmaticOS takes the pain out of drug discovery, closing the loop between computation and the bench, and putting human scientists into the driving seat.

Sigmatic Sciences is led by a team with decades of experience in science, AI and bioinformatics and backed by one of Europe’s leading healthcare investors, GHO Capital.

Find out more online at: www.sigmaticsciences.com