UK Biobank and NIH's 'All of Us' release large-scale datasets

Each data-release is an invite to the research community to ‘dig in.’
UK Biobank and NIH's 'All of Us' release large-scale datasets

“Happy birthday to you, happy birthday to you, happy birthday, dear All of Us, happy birthday to you.” sang Francis Collins, the director of the National Institutes of Health. He chose a musical ending to his presentation at an event dedicated to the one-year anniversary of 'All of Us,' an NIH project that has set out to enroll one million people to share their health and medical data for research purposes. 

This week, the beta version of the ‘All of Us” data browser has ‘gone live.’ The browser is part of the project’s cloud-based curated data resource with summary statistics of de-identified information about 116,000 ‘All of Us’ participants, as Eric Dishman, director of the 'All of Us' research program, said in his presentation. 

The data will grow over time and the portal will keep offering such ‘data snapshots.’ It now contains electronic health data, responses to survey questions about lifestyle habits such as smoking, and some of the physical measurements of the participants, such as their blood pressure. One can currently see, for example, that one of the top ten conditions is ‘pain,’ he said.

As Nora Volkow, director of the National Institute on Drug Abuse pointed out in her presentation, the opiate crisis in the US is responsible, for example, for decreasing the life expectancy of people in the US. Patients suffering from pain are highly vulnerable to the risks of addiction. Approximately 20% of people in the US suffer some form of chronic pain. ‘All of Us’ has a chance to help to explain, on an individual level, which genetic and other factors contribute to vulnerability and heightened addiction risk. Epidemiological data suggests group findings about who might be more likely to become addicted to opiates. “As an individual, how does that translate to me?” she asked. Understanding this would permit a tailored intervention.

More data and a ‘Researcher Workbench’ with data analysis tools for 'All of Us' are in development and will continue to be refined, said Collins.

The workbench launch is slated for this winter, and it will include data from over 200,000 participants, including electronic health record data and physical measurements from these participants. Genomics data and data from wearables will be added, sometime in 2020, according to the project’s timetable, Collins said. Along the way, the ‘All of Us’ team is performing usability, security and privacy testing.

Separately, in March, the UK Biobank released exome sequence data from 50,000 people with the goal of improving prevention, diagnosis and treatment of many conditions and diseases. That project aims to collect data about 500,000 people and follow their health. These data are linked to the individuals’ National Health Service records. Projects of this type are being pursued in a number of countries including Iceland, Japan and Canada.

NIH's 'All of Us' is seeking to enroll one million participants  (gilaxia)

A separate, older large-scale data-collection project is run by the World Health Organization International Agency for Research on Cancer. 

In the European Prospective Investigation into Cancer and Nutrition (EPIC), more than half a million participants in ten European countries were enrolled in this study between 1992 and 1997 to track information about diet, lifestyle, weight,  medical history. The data are linked to cancer registries in a number of the participating countries. There are blood samples of nearly 500,000 people. Information about access to these data and how to submit an application is available on the project web site. Participants were contacted every few years to update the data about them. 

Data deep-dives

Collecting and analyzing more data and more diverse data in ‘All of Us’ and the UK Biobank will let researchers do data deep-dives to find patterns associated with health and disease. They can explore the variety of genomic factors that might confer risk, or look at effectiveness of various preventive measures.

Perhaps, said Collins, these data might reveal that Type-2 diabetes is not one but rather several conditions, each shaped by factors that affect vulnerability, resilience and different treatment responses. The diverse data, which includes diet diaries or microbiome sample analyses, might help to pinpoint factors that indicate increased risk for Alzheimer’s disease.  And the community of participants is positioned to be enrolled quickly in programs that can focus on their health.

‘All of Us’ aims to enroll one million people across the United States, to collect health data, help speed up research related to ‘precision medicine, in which prevention and treatment “is no longer one size fit all but tailored to the individual,“ said Francis Collins in his presentation.

(Credit: V.Marx)

The project harkens back to 1948 when the Framingham Heart study was launched. This longitudinal study started with 5,209 volunteers in the town of Framingham Massachusetts and has grown to 15,000 people. The study is devoted to factors that influence and shape cardiovascular disease. The project celebrated its 70th birthday last year. Researchers can apply to access the data here.

In 'All of Us,' over 230,000 people have agreed to take part and more than 142,000 have completed the protocol of enrollment. For example they have agreed to share their health records and donate samples. Nearly 80% of the participants thus far are from communities that have typically been underrepresented in research and half of the participants are racial and ethnic minorities. In the past these communities have been left out in research and therefore left behind when cures have been discovered, Collins said.

The UK Biobank plans to finish sequencing the exomes of 500,000 people by early next year. (Getty Images/iStockphoto/helovi)

UK Biobank

The UK Biobank is a project devoted to collecting health information on a large scale and the team is releasing data in tranches. The sequence data released in March from 50,000 people were generated in the US at the Regeneron Genetics Center, as a collaboration between the UK Biobank, the companies Regeneron and GlaxoSmithKline (GSK). 

The companies had a period of time for exclusive access to the data, which is now being released to the research community. Regeneron is also directing a consortium of pharmaceutical companies to finish sequencing the exomes of the other 450,000 Biobank participants by next year.

500,000 people in the UK between the ages of 40 and 69 have given consent to be part of the UK Biobank. The participants answered questions about their lifestyle and habits. They underwent eye and ear tests, an electrocardiograph. And they gave samples, such as blood, saliva, urine.

Since this baseline assessment, which took place between 2007 and 2010, the health of participants can be followed, given that these data are linked to individual medical records and national registries such as those that track cancer incidence. Independent of their healthcare in the UK’s National Health Service, some of these people will be assessed again for this study. In an ongoing fashion, the participants are imaged: their bodies and their brains.

As a resource, the UK Biobank, is, as the access guidelines state “available to all bona fide researchers for all types of health-related research that is in the public interest.” Scientists in the UK and around the world, whether they work in labs in universities or companies or non-profit research organizations can request access to the data and the samples.

Datasets from such large-scale resources are the “future of medicine” in that they offer “wisdom from crowds,” my colleagues write in a Nature editorial here. Such data leverage large numbers of participants and can be used to tease out, for example, the role of genetics in health and disease. 

A Nature collection of the first papers from the UK Biobank can be found here.

Analyses of the UK Biobank’s data include a study of genome-wide data by Bycroft et al. Among the findings are loss of function genetic associations that are connected to various health conditions. In Elliott et al. the authors performed a genome-wide association study of brain phenotypes based on imaging of 10,000 individuals to better characterize variation across many individuals and to look at brain changes in terms of structure and vasculature to help with research into diseases such as Alzheimers. (Both studies were published in the same issue of Nature.)

As the multi-center authors of this Bycroft et al study, Clare Bycroft and her colleagues point out, the UK Biobank is “...a powerful example of the immense value that can be achieved from large population scale studies that combine genetics with extensive and deep phenotyping and linkage to health records coupled with a strong data sharing policy. It is likely to herald a new era in which these and related resources drive and enhance understanding of human biology and disease.” 

Nancy Cox, a researcher at Vanderbilt University Medical Center commented in her accompanying News & Views article that the work “is a vivid reminder that data generation is perhaps the least challenging aspect of big-data science.”

The researchers used arrays to determine nucleotide variation at more than 800,000 genomic sites, and then imputed variation at millions more sites. “But the scale of the data meant that both the design of this ‘genotyping’ and the subsequent quality-control analysis needed to be wholly reconceived from methods used for smaller studies. Moreover, much of the software used needed to be substantially revised to achieve reasonable computing times.”

For example, the authors recoded the software tool IMPUTE2 so they could impute genotypes faster. That led to IMPUTE4, which can be found here.

Many studies are being published based on analyses of UK Biobank. The data allow different types of genome-wide and phenome-wide research. 

(Getty Images/EyeEm:R. Brenner/EyeEm)


To add to the exome sequencing data, whole-genome sequencing of these 50,000 participants in the UK Biobank is underway in what is termed a ‘Vanguard’ phase. That will be followed by whole-genome sequencing of the remaining 450,000 participants.

This sequencing and the sequence analysis is taking place at Wellcome Trust Sanger Institute and began in August of 2018. The announced plan is to generate around 4,5 petabases of sequence in 18 months, which is by early 2020.

The project will also involve developing, validating and using tools to explore, for example, genome structure, to characterize structural variation, such as copy number variation, mobile element insertions and mitochondrial insertions, telomere length and mitochondrial DNA to look, for example, at the role of rare sequence variants in complex traits.

The UK Biobank’s funders include UK Research and Innovation, the Wellcome Trust and Medical Research Council, Cancer Research UK, British Heart Foundation and Diabetes UK.


Data analysis power will come from tools. For example, the lab of Benjamin Neale at Massachusetts General Hospital develops tools to analyze large-scale genetic data. The lab has posted GWAS results of phenotype-analysis of the UK Biobank data. 

Commercial services are also being launched for analyzing these large datasets. One such suite of cloud-based tools is Apollo from the company DNAnexus.

As a spokesperson for DNAnexus explains, the Apollo platform was set up in October of 2018 to help labs identify drug targets, find biomarkers in a secure computational environment that lets teams work on projects jointly and share results.

The environment lets researchers analyze data with the company’s proprietary tools as well as open-source tools such as the Genome Analysis Toolkit (GATK) developed at the Broad Institute of Harvard and MIT and Google’s DeepVariant, a variant-calling analysis pipeline.

According to the company, Apollo can help scientists analyze the 50,000 exomes and phenotype records rapidly and share data. JupyterLab, an open-source interactive lab notebook also for software, is built into this platform. “The platform enables biologists to play a role without the constant support of bioinformaticians,” says the spokesperson.

'All of us' is personal

Eric Dishman, who directs the NIH ‘All of Us’ research program was diagnosed with a rare form of kidney cancer at 19 years of age. At the time, doctors said he had nine months to live. He has lived longer than that. His treatment included radiation, chemotherapy and immunotherapy. He eventually met a scientist who proposed sequencing Dishman’s genome. 

The sequence revealed that the drugs he was taking were less likely to help, said Dishman. His treatment was switched to a drug that proved more effective, and he became eligible for a kidney transplant. 

His first-hand experience with ‘precision medicine’ has shaped his motivation to leave his Silicon Valley role and join the NIH team. Dishman was a vice president in the health and life sciences group at Intel.

In his presentation at the 'All of Us' event, Dishman pointed out that IT-based advances such as online shopping have involved technology built in an innovation process of: “get it out there, get community feedback, make it better.” He is taking that approach to this project to engage people, explain what the research is, how it might benefit people individually. “We do need the science to have the diversity of people behind it, so the diversity of cures can exist,” he said in his presentation.

The participant portal has been improved based on community feedback and the team plans to keep improving it and also the way information is given back to participants, he says. “Put something out, engage, get the feedback, iterate and improve,” he said. The team touches base with the NIH institutes as they work out a scientific roadmap over the longer term. 

The plan for the portal is to add ways for users to sort data according to race and ethnicity, also by sexual identity and gender. The team is working with communities on how best to do so. This interaction is about striking the balance between offering an open resource and protecting participants’ identity and data from “awful, stigmatizing, unethical research that often gets done with crude analysis of aggregate statistics,” said Dishman. “We need community feedback.” 

(Getty Images/iStockphoto/peppi18)



Please sign in or register for FREE

If you are a registered user on Protocols and Methods Community, please sign in

Go to the profile of Ben Johnson
over 4 years ago

Really interesting, Vivien! It's great to see that patient and public involvement, as well as diversity, are seen as key to success.