Not only has Genome Reference Consortium build 38 (GRCh38) eliminated some pesky previous gaps, it will be the first human reference assembly to have sequence information for centromeres. Up until now, centromeres, which are specialized structural components of chromosomes, have been represented in the reference by gaps of 3 million base pairs. The news about centromere sequence will be of interest to cell biologists and genomics researchers alike.
“This will be a major boon to evolutionary studies of human populations and to the many groups doing mechanistic work on human centromeres and kinetochores,” says Stanford University researcher Aaron Straight, whose work focuses on cell division and chromosome segregation. “Finally, now we can stop saying ‘mind the gap’.”
The reference genome finishers are the members of the Genome Reference Consortium (GRC) at the European Bioinformatics Institute, the US National Center for Biotechnology Information, The Wellcome Trust Sanger Institute and The Genome Institute at Washington University.
Scientists may not have physically camped like concert-goers in front of the buildings where genome finishers scurry to get the sequence out the door. But the throngs have been virtually present. The GRC, which works on human, mouse and zebrafish reference genomes, is “having to field a lot of questions from folks who want to know the minute they can have the assembly,” says Deanna Church, a genomicist formerly at the US National Center for Biotechnology Information and who has, since this interview, moved to Personalis, a genetic testing and analysis company.
The din has faded from the 2001 celebration marking the end of the Human Genome Project. But the sequence was not complete nor is it complete now. As colleagues at Nature Methods have pointed out here and here, the sequence originally had around 150,000 gaps.
The most recent reference genome, Genome Reference Consortium build 37 (GRCh37), has 357 gaps. And is missing sequence around the centromeres. No longer.
Come here, centromere
The structure and repetitive nature of centromeric regions has made them largely inaccessible to methods used to create the reference assembly, says Church. The concept and the methods to produce the centromere sequences for this reference build were developed by a research team at University of California at Santa Cruz (UCSC). They constructed sequences using the Sanger technique and the data helped the team behind GRCh38 to fill in these important gaps.
In a paper, the UCSC team, led by Karen Miga and Jim Kent, a member of GRC’s scientific advisory board, noted that centromeric regions are replete with near-identical tandem repeats—satellite DNA. Difficult assembly of these regions have led them frequently to be excluded from genomic studies. In the new reference genome, the scientists used reads generated during the Venter genome assembly and created models for the centromeres, says Church.
“These models don’t exactly represent the centromere sequences in the Venter assembly, but they are a good approximation of the ‘average’ centromere in this genome,” she says. And these sequence models are not exact representations of any one centromere, either. But including these sequences in the reference assembly “will likely improve genome analysis using current methods, and allow for some further study of population variation in centromere sequences,” says Church.
Yes, gaps remain
The new human reference genome still has gaps that the GRC is working to close, says Valerie Schneider, a staff scientist at the National Center for Biotechnology Information. As ghastly as it may sound, assembly improvements can also add gaps. For example, new sequence may be added into a gap. But if it does not bridge the gap, the added sequence is an improvement that turns one gap into two, she says. “Likewise, the identification of tiling path errors or intra-component deletions may result in the introduction of gaps into what was previously contiguous sequence.”
Structural variation, often associated with segmental duplication, can cause gaps in the human reference assembly. “When components representing two structurally variant haplotypes are both included in the reference assembly, their differences may introduce a ‘de facto’ gap,” she says.
The new reference has done away with several of these types of gaps, including one on chromosome 10 associated with the mannose receptor C Type 1 (MRC1) locus and one on chromosome 17, associated with the chemokine (C-C-motif) ligand 3 like 1 and ligand 4 like 1 (CCL3L1/CCL4L1 ) genes. Single haplotype resources have helped with these efforts, she says.
Of the 357 gaps in GRCh37, more than 230 were deemed “recalcitrant to subcloning,” as Schneider puts it, which is a polite way to say they were such pains in the rear that they were not amenable to amplification in bacteria. These gaps are also often associated with segmental duplications or repeat structures. Researchers whittle away at these gaps by using publicly available whole genome sequencing data, Schneider says.
271 of the existing gaps in build 37 are ‘spanned’, which means that there is some evidence that a clone covers the gap, but there is no sequence for that clone, says Church. 86 gaps are not spanned: there is no evidence that any known clone crosses the gaps.
Because genomes sequenced with second-generation technologies typically do not rely on cloning vectors, sequences that extend into or span these gaps may be present, says Schneider. There have been additions from whole-genome shotgun sequencing at nearly 100 of GRCh37’s assembly gaps.
The reference assembly has also contained gaps that represent biological structures such as telomeres, heterochromatin, the short arms of the acrocentric chromosomes as well as centromeres. While the telomeres continue to be represented by default 10 kilobase gaps, GRCh38 includes new sequences on the short arms of chromosomes 21 and 22, in addition to model sequence representation for centromeres and some heterochromatin.
Happy fans, happy curators
As the finishers give the new reference genome its last sheen, they sense its release will make their fan base happy. Too true: finishing work is neither considered hot nor does it lead to a high volume of scientific publications, but genome finishing has “undeniable” impact, says Schneider.
Its impact is what gets GRC curators charged about their work. Clinical tests may be improved or analyses rendered more robust all because a gap was closed or sequencing errors corrected, she says. Of course, the pressure is on to “get things right,” since assembly errors can be far-reaching.
Although it may sound unlikely, patching the reference sequence is fun. “One of my favorite assembly updates is actually one that the GRC released as the first fix patch to GRCh37, and involves the ABO locus,” says Schneider. In GRCh37, the ABO blood group locus was split over two assembly components. Both components represented a type O allele, but not the same type O allele, which resulted in an ABO sequence not found in any known human.
The GRC replaced the assembly components so that ABO is represented by a single component and is a valid type A allele. “There are also several genomic regions that were misassembled in GRCh37 that have been massively retiled that I am particularly excited to see in GRCh38,” she says. These include the pericentromeric region of chromosomes. 9, 10q11, and the Slit-Robo GAPase-activating protein genes (srGAP) region at 1q21.
All of these path corrections will improve the assembly and the analyses that depend on it. But these latest updates “were among the most challenging” thus far, she says. They required collaborations between the GRC and various researchers who provided access to data and techniques ahead of publication. This activity shows how both the research and the clinical communities help to improve this shared resource, Schneider says. It is also a personal achievement for the curators, because “knowing that your work will have a wide impact, is a big motivator.”