Garbage in, garbage out: The hidden data crisis in pharma
Artificial intelligence (AI) is reshaping industries across the board, but nowhere are the stakes higher than in life sciences. Pharmaceutical companies see the potential of AI to accelerate drug discovery, streamline clinical trials, and reduce development costs that often exceed $2 billion per drug.
Yet, despite the enthusiasm, most pilots struggle to deliver results. According to MIT, 95% of AI projects fail in the enterprise, often because the models are fed poor-quality or irrelevant data. While the exact number in pharma is debatable, the pattern of unmet expectations is undeniable.
The common thread running through these failures is not the sophistication of the algorithms, but the quality of the data they consume. In pharma, “garbage in, garbage out” can mean not just wasted time and money, but misleading outputs that carry regulatory, ethical, and even patient safety consequences. This hidden data crisis is the fundamental barrier to scaling AI in the sector.
What the hidden data crisis looks like
When pharma leaders talk about data problems, they are not only referring to poor-quality information. The crisis is broader:
- Irrelevant or uncurated data. Much of the data used in AI pilots is drawn from public sources or scattered across internal silos. Without curation and context, the outputs lack clinical relevance.
- Dark data. Vast troves of clinical trial results, patient histories, and imaging data remain locked in inaccessible formats or legacy systems.
- Bias baked into inputs. Even well-intentioned data collection can inadvertently encode noise. A widely cited example is the “ruler problem”, where an AI model trained to identify malignant melanomas mistakenly learned to associate the presence of a ruler in diagnostic photos, rather than tumour characteristics, with malignancy.
These pitfalls illustrate how AI, when deprived of a clinical lens, does exactly what it is designed to do, which is to find correlations. But, without contextual guidance, those correlations can be dangerously misleading.
Why pharma has less room for error
In consumer industries, a failed AI pilot might mean a misdirected marketing campaign or an underperforming chatbot. In pharma, the consequences of failure ripple much further. Clinical trials are already among the most expensive and time-consuming undertakings in the sector. Delays of even six months can cost hundreds of millions of dollars in lost revenue.
Moreover, the industry operates under stringent regulatory oversight. Outputs that are not clinically valid are not just unhelpful, they can be non-compliant. Unlike in retail or manufacturing, there is little tolerance for “move fast and break things”. Pharma companies must instead move carefully and prove things.
Enterprise-ready AI requires clinically curated data
The term “enterprise-ready” has become a buzzword in AI, but in regulated markets the meaning goes beyond scalability and cloud integration. In pharma, enterprise-ready AI means meeting three standards. It means working with clinically curated inputs, where data is cleaned, contextualised, and structured with a clinical mindset. This requires moving away from scraping public datasets and focusing instead on creating proprietary, high-quality corpora built from trial histories and real-world evidence. It also means applying robust governance. AI projects need to achieve technical benchmarks while meeting compliance frameworks that include FDA and EMA requirements, with attention to data lineage, auditability, and reproducibility.
The final standard is alignment with clinical expertise. The most promising models reflect the reasoning of an experienced researcher, rather than functioning as statistical black boxes. Large language models guided by clinical oversight can avoid misleading shortcuts such as the melanoma “ruler problem”. Without clinically curated data, AI in pharma becomes a liability, rather than a strength.
How data issues surface in real-world pilots
Pharma companies often recognise data problems only once pilots are underway. Models may produce superficially compelling outputs that, upon clinical review, prove irrelevant or misleading. In imaging analysis, for instance, models may pick up on lighting or annotation artefacts instead of true biological features. In patient selection, biases in historical datasets can skew trial recruitment toward unrepresentative populations, undermining trial validity.
This is why so many pilots stall at the proof-of-concept stage. The underlying algorithms may work, but the data curation is insufficient to generate results that are both clinically actionable and regulatorily defensible.
A hybrid model: Start-ups + in-house curation
Given these challenges, many pharma leaders face a build-versus-partner dilemma. Should they develop AI capabilities internally or look to start-ups for innovation? In practice, the answer is often hybrid. Start-ups bring cutting-edge techniques and attract AI engineering talent that big pharma struggles to recruit. Pharma incumbents bring proprietary datasets, regulatory expertise, and the infrastructure to scale.
The most successful partnerships combine these strengths. Start-ups that work with pharma companies on internal datasets, rather than relying solely on public data, tend to deliver more relevant outputs. Conversely, pharmas that expect start-ups to solve everything with limited resources risk disappointment.
De-risking partnerships in a crowded start-up landscape
The AI-in-pharma start-up ecosystem has exploded, with more than 100 companies launched in the last five years targeting everything from molecule discovery to trial recruitment. This creates both opportunity and noise. Pharma leaders can de-risk partnerships by:
- Assessing scalability. Promising “two-person garage start-ups” may have innovative ideas, but lack the resources to responsibly handle sensitive data. Partnering requires confidence in both technology and organisational maturity.
- Validating investors. Start-ups backed by credible, well-capitalised investors are better positioned to survive long timelines and regulatory hurdles.
- Piloting with clear boundaries. Structuring engagements around specific, well-defined use cases reduces exposure while testing viability.
Building internal capability: The rise of the chief AI officer
Another notable trend is the emergence of chief AI officers within large pharmas. These leaders, often recruited from outside traditional life sciences, bring data science expertise and serve as internal champions for AI adoption. Their role is not only to evaluate partnerships, but also to ensure internal datasets are curated, governed, and usable. This institutional investment signals that AI is no longer viewed as an experimental side project. It is becoming a core capability.
What success looks like
When pharma manages its data effectively, the benefits reach every stage of development and patient care. Shortening timelines in clinical trials by even six months can save hundreds of millions of dollars per drug and accelerate patient access to lifesaving therapies. Better patient stratification can improve trial outcomes, reduce attrition, and make results more generalisable.
Beyond trials, curated data can enable AI to predict adverse events earlier in development, optimise manufacturing processes with greater compliance, and identify new indications or patient subgroups for existing drugs.
Curated, clinically relevant data is the bridge from AI promise to AI impact in pharma. The companies that invest today in solving the data challenge will be those that deliver tomorrow’s breakthroughs faster, safer, and more efficiently.
About the author
Erik Terjesen is managing director of Silicon Foundry, a Kearney Company. He has spent his career turning clean technology research into commercial products that make our world cleaner and more efficient. At Silicon Foundry, he advises organisations on their cleantech adoption, commercialisation, and investment strategies. Before that, he worked at Ionic Materials, where he negotiated partnerships to bring the company’s novel solid polymer technology to market in battery applications. Earlier in his career he gained experience in venture investing at HarbourVest Partners and investment banking at Robertson Stephens. Terjesen holds a BA from Harvard and an MBA from Wharton. When he’s not working, you’ll find him spending time with his family in San Diego and working on his electronic music production hobby.
