AI in Drug Discovery: Day One

Digital
AI in drug discovery

There is no denying that artificial intelligence (AI) has captured the imagination of innovators working across the pharmaceutical sector. The drug discovery space is no different, with myriad applications promising to revolutionise the way that researchers and sponsors approach identifying and progressing new drug candidates.

Attending this year’s AI in Drug Discovery conference in London on 11th March, this same sense of enthusiasm was palpable among those who had travelled from across the globe to hear how AI is being used to transform preclinical drug discovery into a forecasted $50 billion market over the next decade.

Digital chemistry and automated target discovery

Day One began with a welcome from conference chair Dr Darren Green, head of cheminformatics & data science and senior fellow at GSK, who, acknowledging the packed schedule of presentations ahead, kept his welcoming remarks brief, before inviting his colleague Dr James Lumley, associate director of cheminformatics & data science at GSK to the stage. Delivering the first presentation proper, Lumley set the tone for what was to be a day fuelled by complex discussions of computational chemistry.

Lumley discussed applying automation and solving what he called “the machine learning operations [ML Ops] problem” to build and scale models in minimal time. A key element, he noted, is the importance of thinking about scale in terms of deployment. Notably, he emphasised the importance, but difficulty, of getting people to understand the value of data collection, noting that “behaviour change is a beast in itself”.

Some minor technical issues meant a last-minute switch of speakers, with Karl Leswing, executive director of machine learning at Schrödinger, called up to deliver his presentation on the latest advancements in ML-enhanced in silico design and the impact on the drug discovery pipeline.

Leswing began his talk with a theme that would echo throughout many of the day’s presentations: Finding that one molecule with the desired properties in a sea of millions of potential candidates is an enormous task – one that comes with a hefty price tag attached. This is where AI and ML can be highly valuable tools. To illustrate how valuable, Leswing spotlighted the recent rise of fragment-based drug discovery, which has seen more than 50 FDA approvals.

Schrödinger, he explained, has leveraged digital chemistry to create a library of over 200 million fragments, which meet the rule of three (the molecular weight of a fragment is <300, the cLogP is ≤3, the number of hydrogen bond donors is ≤3, and the number of hydrogen bond acceptors is ≤3). This allows researchers to optimise multiple properties simultaneously and predict protein-ligand binding with a high degree of confidence across broad chemical space. Leswing further explained that this approach is being actively used in the Schrödinger Drug Discovery Group for identifying hits for a wide range of target product profiles (TPP), with nine screens recorded to date, all yielding multiple potent ligand efficient hits and more on the way.

For generative molecular design in hit-to-lead and lead optimisation, Leswing spotlighted Schrödinger’s Autodesigner tool, including its recently added core design capability. He described Autodesigner as making “moves” in chemical space analogous to chess moves. To illustrate the potential of the tool, Leswing described a study in which Autodesigner’s performance was compared to other computational and crowdsourced design methods. Over two weeks, the crowdsourcing group produced 317 ideas, Autodesigner generated 118,000 ideas in one day. Two compounds were progressed, one solely from Autodesigner and one from both Autodesigner and the crowdsourced set.

Technical hiccups resolved; it was time for Dr Petrina Kamya, president of Insilico Medicine Canada Inc., and global head of AI platforms and VP at Insilico Medicine to take the stage. Rather than sticking to theoretical examples, Kamya made an upfront commitment to enrich her talk with tangible case studies, answering the two questions she encounters most often when talking about Insilico’s work in AI drug discovery: “How do you do it and how much is actually driven by AI?”. Over eight years, she explained, the company has adeptly learned from Deep Learning strategies being used in varied sectors, tailoring them to meet the intricate demands of drug discovery.

The result – Pharma.AI. Described as a fully integrated end-to-end generative AI and robotics drug discovery engine, Kamya explained that the system comprises three verticals: Biology 42 – led by the pandaOmics app, Chemistry 42, and Medicine 42 – led by the inClinico app. Highlighting AI’s applications in pandaOmics – an AI-driven automated target discovery engine – as an example, she explained that the technology is used to identify genes, diseases, compounds, and biological processes, as well as the associations between them. Moreover, AI is employed to capture trends and analyse the potential for an attention spike or the initiation of a phase 1 trial for a given target-disease association.

With the aid of AI and ML, in just four short years, the company’s pipeline has expanded to more than 30 programmes, with more than 20 peer-reviewed papers published in 2023 alone.

Enhancing the Design-Make-Test-Analyse cycle

Following a brief sojourn, attendees filed back into the presentation room. This time it was Daniel Cohen, president of Valence Labs and VP of Recursion Pharmaceuticals, taking to the stage to discuss the impact of large-scale models in drug discovery.

Created in mid-2023, when Valance Discovery joined forces with the biotech company Recursion, Valance Labs, Cohen explained, has an ambitious vision, in which “the future of scientific discovery will be driven by autonomous agents capable of formulating and testing hypotheses at unprecedented scale.”

The strategy for achieving this goal comprises three main pillars: foundational models of biology and chemistry, an inference engine coupled with active learning techniques, and orchestration and reasoning systems designed for autonomous hypothesis testing with minimal human intervention. Cohen pointed out a critical challenge in the current approach to molecular design is ineffectiveness at systematically prioritising molecular representations – and suggested that adopting trends from the broader machine learning community could enhance the prediction of molecular properties.

Delving deeper into the concept of active learning, Cohen emphasised its importance in drug discovery, particularly in the context of pairwise comparisons, where it can drastically improve efficiency and outcomes over random search methods by focusing resources on more promising candidates.

Moving on to the subject of large language models (LLMs) in drug discovery, Cohen noted their potential as reasoning and organisation systems. To illustrate his point, Cohen demonstrated how Lowe, Valance Lab’s LLM-based software, can, when prompted by a human-generated query, autonomously navigate through various steps, including data retrieval and analysis, to generate actionable insights. He tasked Lowe with identifying targets involved in Non-Small Cell Lung Cancer (NSCLC), showcasing the model’s ability to provide a comprehensive summary of the query, employ appropriate tools, and offer valuable information drawn from both public and proprietary data sources.

This process, he explained, not only yields a list of potential compounds visualised in a chemical space, but also enables the identification of similar compounds through natural language prompts, facilitating further exploration and validation by human researchers. “The sky is the limit with these types of approaches,” he concluded.

Next up was a presentation by Philippe Moingeon, Professor at Paris-Saclay University & head of immuno-inflammation at Servier Pharmaceuticals, on virtual patients and causal disease models to predict drug efficacy in silico.

Moingeon discussed new forms of disease representation to simulate drug candidate efficacy, including digital twins – such as the living heart model developed by the FDA and Dassault System to certify pacemakers and stents – as well as causal disease models representing disease genetics and virtual patients.

For diseases like lupus and Sjogren’s syndrome, Moingeon highlighted that Servier has partnered to deeply, molecularly profile patients and build causal disease models capturing heterogeneity. This allows for identifying the right targets using dedicated models as the basis for precision medicine.

As an example, Moingeon explained that Servier’s lupus model identified four distinct patient subsets based on different master regulator genes compared to healthy individuals. Virtual patient models combine biological and clinical data to simulate how a drug would perform in that patient.

Using a recent AstraZeneca clinical trial, Servier created up to 20,000 virtual patients, but focused on 241 mirroring the real patient data. Moingeon described an industry shift towards “mixed reality”, combining predictive modelling and empirical studies to predict drug efficacy/safety across billions of virtual compounds and patients.

Following Moingeon was the penultimate session for the morning, led by Dr Grégori Gerebtzoff, director of pharmacokinetic sciences (PKS) at Novartis Biomedical Research. A pivotal aspect of his talk revolved around the rationale for developing proprietary models at Novartis. Gerebtzoff highlighted that the quality of models significantly depends on the experimental data used for their training. In-house data, according to him, can, in some instances, provide a richer, more relevant foundation for model building compared to externally sourced data. This has led to the creation of models of varying sizes within Novartis, each tailored to the specific nuances of their proprietary data sets.

Delving into the comparison between local and global models, he addressed a common misconception that local models, which are built on a narrower set of data, are inherently superior to global models that utilise a broader data spectrum. He argued against this notion, presenting evidence that local models often suffer from abrupt changes in performance due to their limited data scope. In contrast, global models benefit from a time-split approach that can provide a more stable and reliable performance across a wider range of compounds.

It was then time for our final pre-lunch presentation of the day. Dr Yogesh Sabnis, director and lead design at UCB Biopharma srl, began his talk with a thoughtful gesture – setting a timer to ensure that hungry attendees could enjoy their lunch on time. Turning his focus to the topic at hand, Sabnis set about exploring the intricate relationship between humans and technology, posing reflective questions on our readiness to make decisions and how technology can empower us to make better choices more consistently. He highlighted how UCB Biopharma has successfully harnessed technology to improve decision-making processes in small molecule research and development (R&D), thereby bridging the gap between technological advances and practical applications.

The D2P2 platform at UCB, designed to enhance the Design-Make-Test-Analyse (DMTA) cycle, embodies this digital ambition. Sabnis outlined the dual goals of the platform and its implementation, which is geared towards minimal “babysitting” or maintenance, echoing early statements, elaborated on the four critical aspects of D2P2 – data science, algorithms, integrated development, and analytics and reporting – underscoring the platform’s collaborative nature and its comprehensive coverage of small molecule profiles at UCB.

An essential tenet of D2P2, according to Sabnis, is the principle that “anything that can be automated, should be automated,” ensuring the platform’s accessibility and practical utility.

Small molecules, operationalisation, and the value of open-source

Kicking off the post-lunch session with fresh enthusiasm, Dr Gerhard Hessler from Sanofi delved into the cutting-edge realm of batch learning for small molecules, building on the morning’s discussions on drug discovery optimisation. He presented an innovative approach focused on leveraging probability properties and machine learning across chemotypes to address the complex, multi-property challenges in drug discovery.

Highlighting the dynamic nature of compound modelling, Hessler noted the necessity of frequent model updates, shifting from monthly to bi-weekly, to enhance predictivity in global ADME models.

A significant portion of his talk concentrated on uncertainty predictions, particularly employing Monte Carlo dropout techniques for reducing error rates in novel compound predictions. This method, he explained, has led to more accurate and reliable models by excluding the most uncertain predictions.

Notably, Hessler introduced the concept of Active Learning In data ExploratioN (Alien), an uncertainty-based selection process for new samples that optimises the efficiency of labelling costly data. He concluded that active learning, especially in the batch mode, significantly improves predictive models, providing a key takeaway for enhancing drug discovery processes.

Continuing the momentum set by Hessler, Dr Christophe Chabbert from Roche took to the stage to discuss the operationalisation of machine learning at speed and scale within the Roche drug discovery pipeline. AI, he explained, can play a vital role in fostering a more agile framework for discovery and delivery, a notion highlighted by a visual depiction of how strategic deployment of ML across research workflows and the prioritisation of a portfolio of ML models can be used to tackle operational challenges.

Addressing the gap between end-users, such as medicinal chemists and model owners, Chabbert shared how Roche’s creation of an MLOps team has bridged this divide, proving essential for scientists integrating AI into drug discovery. Leveraging open-source tools has simplified the process for model owners, which he explained eliminates the need for researchers to have a deep software understanding of AI tools as ambitious scientists can take a more generative approach to their models. Chabbert concluded by emphasising the characteristics of successful teams and model owners in this space: a clear understanding of the problems ML and AI can solve in drug discovery, a focus on operationalisation, diligent model monitoring, effective use of the data landscape, and the ability to connect people and break down silos.

Next up to present was AstraZeneca’s associate director of medicinal chemistry, Dr Anders Hogner. Highlighting three major challenges in drug design – target identification, optimising properties, and exploring the chemical space – Hogner introduced AstraZeneca’s “Augmented Drug Design” programme. This initiative, he explained, significantly reduces the time computational chemists spend on benchmarking and model building, with general AI taking 20% less time to complete similar tasks.

Addressing the challenge of property optimisation, Hogner spotlighted AstraZeneca’s “Predictive Insights Platform” (PIP) – fully scalable and cloud-based – as an example of how leveraging good models and data can be hampered by inconsistent performance across different chemical series. “We have good models and good data,” he explained, “but this is not always the case.” A proposed solution involves tools that help project teams better utilise these models by selecting appropriate experimental thresholds, thus improving the chances of identifying the right compounds.

In exploring chemical space, Hogner noted the limited utility of merely generating virtual compounds. He discussed how the consumption of information has evolved, advocating for a more curated approach to scientific data. To this end, he explained, AstraZeneca developed ASERIA, an automated series analyser designed in collaboration with chemists to provide a comprehensive project overview, including new compounds, predictive models, and notable matched molecular pairs. Although still under refinement, ASERIA represents a significant step towards integrating AI more effectively into drug design.

The final presentation of the session was left to Matt Armstrong-Barnes, chief technology officer of artificial intelligence for Hewlett-Packard Enterprise (HPE), who began his discussion with an unusual choice of image – a penguin. This may have appeared quite random to those not in the know, but as Armstrong-Barnes quickly explained, the penguin symbolised Tux, the official brand character of the free and open-source Linux kernel. Using the audience, he began to survey the awareness of and use of open-source systems. While hands rose across the room when asked about awareness, when it came to the question of active participation, the contrast was striking. This significant drop in contributions perfectly complemented Armstrong-Barne’s first statement: driving open-source development forward is a collective responsibility.

Having set the tone for what was to come, Armstrong-Barnes drew attention to the essential nature of data for AI, which he paralleled with the necessity of open-source for AI’s existence, citing historical milestones like Google’s extensive open-sourcing in 2015. Layering open-source initiatives to construct a comprehensive AI infrastructure is particularly important, he explained, from foundational services to deployment platforms, ensuring that open-source components are integral at every stage.

Highlighting the sustainability aspect, he pointed out the environmental impact of training large language models and the challenges of scaling AI projects. Armstrong-Barnes emphasised the distinction between data scientists and programmers, advocating for the right tools and environments to enable specialists to excel in their respective fields.

Furthermore, he discussed the operational aspects of AI projects, noting that the bulk of the work lies beyond writing ML code – areas where the open-source community excels. Yet, navigating the rapidly evolving AI and ML landscape is complex, with a small percentage of the community contributing the majority of open-source code.

Concluding his talk, Armstrong-Barnes stressed the importance of a data-first approach in AI native architecture and the need for open-source solutions to offer flexibility, mitigate skill gaps, and provide cost-effective alternatives to monolithic, vendor-locked systems. He urged for fast failure in experimentation, leveraging existing open-source solutions, and keeping abreast of emerging regulations and alliances within the AI infrastructure space.

Biodiversity, FBDD, and AI vs the spliceosome

“A lot of biotech innovation comes from outside of humans, but we have studied only a tiny fraction of life on earth,” said Philipp Lorenz, CTO of Basecamp Research, as he took to the stage to discuss how next-generation life sciences AI models and gene therapies are being powered by a global biodiversity data supply chain, spanning five continents. Beginning with a brief history of the unusual company, he explained that Basecamp Research has forged partnerships with Nature Parks worldwide, covering 60% of global biomes to create a vast biological data supply chain.

This wealth of data feeds into the company’s knowledge graph, BaseGraph, which boasts over six billion relationships connecting hundreds of millions of unique protein and genome sequences with their evolutionary context. Lorenz highlighted BaseGraph’s significant sequence diversity – five times greater than what’s available in public databases – allowing them to push the boundaries of Deep Learning models in areas like functional annotation, structure prediction, and protein design using LLMs. This is particularly important, as “using environmental sequences from public databases for commercial purposes without benefit sharing is illegal in at least 16 countries,” Lorenz noted.

Basecamp Research’s approach, he explained, not only seeks to map uncharted biological diversity, but also to address challenges such as biopiracy by navigating the complex regulatory landscape across different countries. With over 70 expeditions in 24 countries, their effort underscores the indispensable role of hands-on exploration in understanding Earth’s biodiversity. “There is no shortcut – you actually have to go to these places,” he laughed.

Lorenz shared insights into how Basecamp Research utilises this diverse dataset to refine Deep Learning models, particularly in structure prediction. By supplementing Alphafold2 with their sequences, Basecamp achieved superior performance over CSP15 targets, with notable improvements observed in the C terminus of protein structures. Moreover, their contrastive Deep Learning approach in BaseAnnotate has set new standards in functional annotation, identifying 20 previously unannotated kinases in the human proteome.

The following presentation, delivered by Dr Carl Poelking, senior researcher in computational chemistry & informatics at Astex Pharmaceuticals, focused on the transformative role of AI in FBDD. This strategy, he explained, synergises AI-driven technologies with human expertise to enhance the design of preclinical candidates by leveraging the rich structural context inherent to FBDD.

He discussed specialised predictive and generative technologies developed at Astex, which incorporate prior knowledge along with structural, synthetic, and directional constraints into the design process. These technologies are pivotal in improving the efficiency and effectiveness of drug discovery, enabling better structures and designs faster than traditional methods.

One significant area of focus was the use of cryo-electron microscopy (cryo-EM) for targeting new entities. Despite being an experimental technique with inherent challenges, such as latent variable problems and time-consuming data collection, Poelking highlighted computational advancements that are beginning to expedite the “noise to reconstruction” process, thus speeding up research.

Co-founder & CTO of Envisagenics, Dr Martin Akerman, concluded the day’s presentations with an overview of SpliceCore, an innovative AI-powered target discovery platform focused on identifying novel and tumour-specific epitopes for immunotherapeutic development.

He began with an overview of the SpliceCore platform, emphasising its precision in pinpointing tumours characterised by splicing errors. The spliceosome, he explained, is one of the largest molecular structures in the cell, with more than 300 proteins. It is highly dynamic; however, it is also very vulnerable and has the highest rate of errors. In fact, Akerman explained, splicing deregulation is a common characteristic in cancerous cells.

Akerman delved into the methodology behind SpliceCore, which involves analysing exons - the smaller gene building blocks sufficient for drug discovery. By focusing on trio sets of exons from the data, SpliceCore simplifies the intricate puzzle of full-length transcripts into manageable pieces, minimising information loss and enhancing target identification accuracy.

With SpliceCore capable of analysing more than 14 million splicing events, Akerman highlighted the platform’s efficiency in navigating the vast search space for reliable exon trios, employing a hierarchical mapping approach to ensure comprehensive coverage without overlooking potential targets. This process, which he likens to managing an investment portfolio, involves diversifying criteria to mitigate risks and optimise the identification of novel, prevalent, and safe targets for therapeutic intervention. “We need to diversify,” he said, “because we don’t have a crystal ball. We want to spread the risk a bit.

Returning to the stage, Dr Darren Green closed the day’s events with a thank you to those in attendance, and a promise that Day Two would offer an equally thought-provoking agenda.

Stay tuned for our coverage of Day Two, coming soon.