Data-driven disease research harnessing the genome

Ben Hargreaves speaks to the UK Biobank to learn more about their work building a dataset of 500,000 individuals. The project is expected to deliver breakthroughs in the understanding of disease, and its success has allowed the organisation to secure funding for future projects.

The pharmaceutical industry’s speed of adoption of the technology driving the digital revolution has often been questioned. However, the COVID-19 pandemic acted to kick this transition into a higher gear. Data is now being discussed more than ever, particularly in terms of the understanding of disease, and the acceleration of drug discovery.

The reality across the healthcare space is that data has been leveraged for a number of years to better understand the human body in general, and disease progression specifically. More than a decade ago, the 100,000 Genomes Project was launched to sequence genomic data from 100,000 participants, in order to better understand the role of genes in health and disease.

At the end of November 2023, the UK Biobank expanded on this previous project to release full sequencing data from 500,000 volunteers. The data has been made freely available to scientists worldwide, providing they have a suitable research project for its use. The data is being provided alongside 15 years of information on many facets of the patients’ health and lifestyle, creating a wealth of information that has already led to breakthroughs in the understanding of disease progression.

Data and disease discoveries

The work carried out by the UK Biobank was funded through £200 million ($253 million) of investment, and the organisation states that the dataset is twice as large as any comparable project. UK Biobank carried out the research that involved logging the entire genetic code of 500,000 people, which created 27.5 petabytes of data (one petabyte is equal to 1,000 terabytes). This information is coupled to 15 years of information from the patient cohort, including lifestyle information, whole-body imaging scans, biological samples, and other health information that is relevant to research.

Naomi Allen, chief scientist at the UK Biobank, spoke with pharmaphorum, explaining how important the gathered data could be. Allen broke down the significance into three potential breakthrough areas: discovering new medicines, creating a deeper understanding of how variants in the genome can influence health, and generating more knowledge on specific diseases.

On the first point, Allen stated that medicines developed with supporting evidence from genetics are more than twice as likely to progress from phase 1 to approval, compared with those drug candidates without this evidence. Selecting a suitable therapeutic target is an important decision for pharma companies, and having the dataset to work from could allow them to better predict how a medicine might impact different patient groups prior to trials.

On the second point, Allen highlighted how little is known about the function of most of the human genome, citing that research has focused on protein-coding genes and their impact on disease risk, but these compose only approximately 2% of the genome. According to Allen, a study using previously released data from the UK Biobank has found examples where rare variants are associated with specific genetically-determined characteristics. In the future, there are expectations that thousands or hundreds of thousands of genetic variants that contribute to disease will be found in the dataset.

Finally, Allen told pharmaphorum how, for many illnesses, such as Parkinson’s, Alzheimer’s, and autoimmune diseases, the underlying origins are poorly understood: “This amazing new dataset allows scientists to explore how genetics affect levels of proteins, metabolites, and other physiological factors, more closely than ever before, which will accelerate our understanding of the genetic underpinnings of disease.”

On the announcement of the publication of the data, the UK Biobank revealed that over 30,000 researchers from more than 90 countries had registered to use the database for research. More than 9,000 peer-reviewed papers have already been published based on early access to the data.

Collaboration stokes breakthrough

The work carried out by the UK Biobank was funded by the Wellcome Trust, UK Research and Innovation, and four pharma companies: Amgen, AstraZeneca, GSK, and Johnson & Johnson. These participants allowed for the £200 million in funding required to finance the project through to completion.

“This data wouldn’t have been possible without pharma’s input […] This collaboration has not only cost £200 million, but it has also taken a huge amount of expertise, focus, and drive to pull this together. The project has taken five years and over 350,000 hours of genome sequencing and required a huge amount of collaboration to be successful,” Allen explained, about the role that pharma had played in aiding the project.

As well as this, Amgen’s subsidiary, deCODE Genetics, was responsible for the DNA sequencing and additional informatics processing support, in collaboration with the Wellcome Sanger Institute. According to the UK Biobank, the industry consortium also led efforts to process and ‘joint call’ the genomes using the ‘Dragen’ pipeline on AWS infrastructure, enabling the data to be transformed into a single combined genetic dataset.

In return for the investment into the UK Biobank’s research, the pharma companies were provided with nine months of exclusive data access. The UK Biobank noted that the four pharma companies involved plan to publicly share their summary statistical analyses, including genome-wide association results, to the wider research community.

What next

Following on from the success of building the world’s largest genetic project, Allen was optimistic about future plans, having already secured philanthropic funding from Schmidt Futures and Ken Griffin, which has been matched by the UK government’s Department for Science, Innovation, and Technology. According to Allen, this will allow the organisation to pilot its scheduled projects.

Allen outlined what these will be: “One of these builds on one of the greatest strengths of UK Biobank, which is having repeated measures over time for our half a million volunteers. We are investigating repeating the entire breadth of measurements that we took when they were first recruited between 2006 and 2010.”

“Another project being considered is in response to the huge demand from the scientific community for us to collect data that will enable research into the determinants of subtypes of neurodegenerative disease (like Alzheimer’s and Parkinson’s disease) and cancer,” Allen continued. “And technology now offers really exciting ways that we might be able to capture important data, such as with wearable technology.”