Computational Integration of Heterogeneous Single-Cell Datasets

Step-by-step instructions for using the LIGER software package to integrate single-cell RNA-seq and ATAC-seq datasets.
Computational Integration of Heterogeneous Single-Cell Datasets

This protocol paper began with a conversation at the inaugural Next Generation Genomics conference in 2019, an amazing meeting organized by Rahul Satija and several other young genomics researchers. The conference featured breakout sessions in which researchers, funding agency program officers, and journal editors talked to each other in small groups. During one of these breakout sessions, the conversation turned to strategies for effective dissemination of computational approaches. Ivanka Kamenova, an editor from Nature Protocols, mentioned that the journal publishes a growing number of computational protocols and suggested that this is a great way to promote adoption of new computational methods. We followed up after the conference, and this paper was the eventual result.

Since the publication of our LIGER algorithm in Cell in 2019, we received many emails and questions on our GitHub page. Users often asked for clarification about specific functions in our package, or requested step-by-step instructions for performing a typical single-cell data analysis using LIGER. Meanwhile, we worked closely with several biological collaborators, who were interested in using the software to answer specific biological questions across a range of systems including differentiating bone marrow stem cells, tumor cells, and neurons. As we continued to answer questions from our users and collaborate with other labs, we realized the importance of developing a comprehensive tutorial to provide researchers on the biological side a quick start and help them understand the underlying strategy behind our method.

When we first published LIGER, single-cell ATAC-seq was a new protocol, and very few datasets were publicly available, so we did not analyze any ATAC-seq data in the original paper. Shortly after publication, however, the 10X Genomics company released a commercial kit for performing the experiment. This kit was rapidly adopted, and several of our collaborators used the 10X Genomics platform to generate ATAC-seq data. Thus, a key focus in writing our protocol paper was working out the details of how to integrate single-cell RNA-seq and single-nucleus ATAC-seq data.

One critical question we had to address was, “How can we make the data from different modalities comparable?” In our case, we need to transform the snATAC-seq data—a genome-wide epigenomic measurement—into gene-level counts that are comparable to gene expression data from scRNA-seq. The traditional analysis of bulk ATAC-seq data centers around hotspots of frequent Tn5 transposase insertion, called chromatin accessibility peaks (Figure 1, top panel). The default analysis pipeline provided with the commercial 10X Genomics platform employs this strategy by aggregating all single-cell profiles into a single bulk profile and then performing peak calling. However, unlike traditional bulk ATAC-seq data, snATAC-seq data are much more sparse (about 90% of the data are zeros), meaning that most individual regulatory elements are not detected in any single cell. In addition, there are biologically important differences among cell subtypes that could be masked by calling peaks using all cells. Thus, we explored several additional strategies for summarizing the chromatin state of each cell. Instead of calling peaks and summing the ones that overlap each gene as the bulk ATAC-seq analysis does, we decided to go with a very simple strategy: counting the total number of ATAC-seq reads within the gene body and promoter region of each gene in each cell (Figure 1, bottom panel). We found that this simple approach works quite well, and provides key advantages: first, it uses all reads, rather than throwing out any reads that do not occur in a defined accessibility peak. Second, it does not rely on peaks called from the aggregate signal of all cells, which could bias against accessibility signals from rare cell types.

Figure 1: Comparison between two reads counting strategies using peak locations or gene body & promoter locations. The strategy shown by the bottom panel counts reads that overlap with any gene body and promoter regions, rather than only reads falling within accessibility peaks.

To fully utilize the rich and complementary information from different modalities, we also developed tools for cross-modality differential expression and imputation analyses. To identify differentially expressed genes or differentially accessible peaks between datasets or joint clusters, we incorporated an efficient implementation of the Wilcoxon rank-sum test. We also incorporated an KNN-based imputation analysis that can infer pseudo-multi-omic profiles, which allows us to perform downstream analyses as if we had paired ATAC-seq and RNA-seq from the same cells.

Single-cell measurements of chromatin accessibility also provide an unprecedented opportunity to investigate epigenetic regulation of gene expression. Therefore, following the imputation analysis, we evaluated the relationships between pairs of genes and peaks, linking genes to putative regulatory elements. We thought it would be great if users can visualize these linkages in a straightforward and informative way, so we developed functions to display predicted relationships between peaks and genes in the UCSC Genome Browser.

In addition to developing strategies for integrating RNA and ATAC, we also made numerous changes to the liger R package to improve user experience and introduce some new functionalities. For instance, we wrote new functions to process and load ATAC-seq data. To increase the computational efficiency of our code, we incorporated C++ implementations of clustering and differential expression functions. We also worked with the developers of the Seurat package to facilitate easy interoperation between our packages (see the seurat-wrapper package--many thanks to the Seurat team for their help!).

High-throughput single-cell sequencing technologies are developing rapidly and producing exponentially growing experimental data, with cell numbers now reaching into the millions. In the future, wouldn’t it be exciting if we could perform integrated analysis in an iterative manner that is more efficient and able to handle continually arriving data? We are developing an “online learning” algorithm that scales to such large datasets and will be adding it to the LIGER package soon. There are several other questions like this one can still ask, and we plan to continue to improve LIGER over time, making it powerful, efficient and flexible while still keeping the tool simple and user-friendly.

Please sign in or register for FREE

If you are a registered user on Springer Nature Protocols and Methods Community, please sign in