Gene expression regulation is essential for cells to fulfill their physiological and pathological roles. Systematically understanding gene regulation mechanisms through the lens of informatics has long been a central theme in our lab.
Advances of single-cell technologies in recent years provide valuable tools for probing cellular gene regulatory circuits at unprecedented resolution. A variety of omics modalities can now be profiled in single cells, including transcriptome, chromatin accessibility, DNA methylation, etc., each of which portrays one specific aspect of the regulatory circuits. While experimental protocols capable of simultaneous multimodal profiling are under active development, most widely-used single-cell technologies only detect one modality per cell, producing unpaired multi-omics data, for which computational integration is required to obtain more holistic views.
We started looking into the unpaired multi-omics integration problem back in 2019 after concluding a project on transcriptomics data integration1. Multi-omics integration was particularly interesting to us because: (1) Joint analysis of multiple omics types offers more insight towards gene regulation; (2) Such integration requires overcoming the discrepancy in omics-specific feature spaces, posing significant computational challenge.
A number of computational methods had already been developed for this purpose. A recurrent solution was to first “project” different feature spaces into a common one, usually the gene space, via a set of handcrafted conversion rules. E.g., chromatin accessibility data is often converted to gene activity by adding up read counts in open chromatin “peaks” overlapping gene bodies and promoters. Nevertheless, we noticed, during our own analyses, that the inferred “gene activity scores” typically show substantially lower power in discerning cell types and states compared to the original “peak” space, which was also confirmed by a benchmarking paper around that time2, partly due to the fact that such conversion hardly retains the original data variance faithfully.
Parallel autoencoders was an obvious multimodal extension of the autoencoder models we and others had been working on in the scope of single-cell transcriptomics1, 3, 4. They enable high-resolution approximation when projecting different modalities into low-dimensional latent spaces. However, ensuring a consistent “semantic meaning” of multiple omics-specific latent spaces is challenging per se. Thus, we formulated the linking problem as a form of prior knowledge-based model regularization. Such “regularization” can go down various ways. Initial trials included regularizing the Jacobian matrix of non-linear neural networks, which turned out to be hardly feasible due to prohibitive computational cost. After many more rounds of trials and errors, we arrived at the current graph-linking strategy, which we termed “GLUE”:
- The prior knowledge is modeled as a guidance graph, and a graph autoencoder is employed to learn feature embeddings from the guidance graph.
- These feature embeddings in turn serve as linear decoder weights for different omics modalities, essentially coupling prior regulatory interactions with observed data correlations.
- Finally, adversarial learning between the encoders and a modality discriminator was applied to ensure proper inter-modality alignment.
Essentially, GLUE features a task-oriented model design by combining the best of many seemingly distinct worlds in AI:
- Combines knowledge-oriented and data-oriented components in a mutually-informative manner. Prior regulatory knowledge encoded in the guidance graph helps orient modalities in the embedding space, while correlations in the aligned multi-omics data provide feedback for integrative regulatory inference via a quasi-Bayesian approach.
- Bridges non-linear and linear components to exploit the strength of each in different areas. We implemented the model with an asymmetric encoder-decoder architecture, where the non-linear data encoders are able to fully utilize the capacity of neural networks to properly align modalities, while the linear data decoders can confer interpretability to the learned embedding spaces.
- Keeps things modular. Improvements in areas of knowledge graph modeling, omics data modeling, as well as domain adaptation can be intuitively incorporated to the corresponding “modules” of the overall model, further improving integration quality.
The last but not the least, we are excited watching the emerging single-cell multimodal profiling technologies like SHARE-seq and 10x Multiome as well as the information-rich data they generated. Combining cutting-edge omics technologies and computational methodologies, we’re approaching the truly systematical understanding of gene regulatory circuits in heterogeneous cellular systems.
- Full text: https://www.nature.com/articles/s41587-022-01284-4
- Source code of GLUE: https://github.com/gao-lab/GLUE
- Cao, Z.-J., Wei, L., Lu, S., Yang, D.-C. & Gao, G. Searching large-scale scRNA-seq databases via unbiased cell embedding with Cell BLAST. Nat. Commun. 11, 3458 (2020).
- Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol.20, 241 (2019).
- Lopez, R., Regier, J., Cole, M.B., Jordan, M.I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
- Eraslan, G., Simon, L.M., Mircea, M., Mueller, N.S. & Theis, F.J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).