A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s42003-023-04928-6.pdf

A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data

ARTICLE https://doi.org/10.1038/s42003-023-04928-6 OPEN A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data 1234567890():,; Yuqi Cheng 1,2, Xingyu Fan3, Jianing Zhang1 & Yu Li 1,4 ✉ Automatic cell type annotation methods are increasingly used in single-cell RNA sequencing (scRNA-seq) analysis due to their fast and precise advantages. However, current methods often fail to account for the imbalance of scRNA-seq datasets and ignore information from smaller populations, leading to signiﬁcant biological analysis errors. Here, we introduce scBalance, an integrated sparse neural network framework that incorporates adaptive weight sampling and dropout techniques for auto-annotation tasks. Using 20 scRNA-seq datasets with varying scales and degrees of imbalance, we demonstrate that scBalance outperforms current methods in both intra- and inter-dataset annotation tasks. Additionally, scBalance displays impressive scalability in identifying rare cell types in million-level datasets, as shown in the bronchoalveolar cell landscape. scBalance is also signiﬁcantly faster than commonly used tools and comes in a user-friendly format, making it a superior tool for scRNA-seq analysis on the Python-based platform. 1 Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong SAR, China. 2 School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA. 3 School of Information and Software Engineering, University of Electronic Science and Technology of China, 610054 Chengdu, China. 4 The CUHK Shenzhen Research Institute, Hi-Tech Park, Nanshan, 518057 Shenzhen, China. ✉email: COMMUNICATIONS BIOLOGY | (2023)6:545 | https://doi.org/10.1038/s42003-023-04928-6 | www.nature.com/commsbio 1 ARTICLE S COMMUNICATIONS BIOLOGY | https://doi.org/10.1038/s42003-023-04928-6 ince the ﬁrst establishment of single-cell RNA sequencing (scRNA-seq) by Tang et al. in 20091, this technology has rapidly become popular among scientists in various biological research ﬁelds. Compared with traditional bulk RNA sequencing which only measures the average gene expression level of the samples, scRNA-seq provides a powerful method to proﬁle transcriptomes on the cell-speciﬁc level. Therefore, it could enable analyzing individual cells and give a more informative insight into cell heterogeneity. The development of scRNA-seq technology has been widely used in several biological research areas, such as cancer research2,3, COVID analysis4,5, developmental biology research6, etc. In these studies, uncovering and identifying cellular populations is one of the most critical tasks. Typically, cell-type annotation involves two steps: (1) clustering cells into different subgroups and (2) labeling each group with a speciﬁc type manually based on the prior-known marker genes. A number of unsupervised machine-learning algorithms have been developed, including classical machine-learning-based methods such as Seurat7 and Scanpy8, and newly published deep learning-based methods, such as scDHA9 and CLEAR10. However, these methods can be time-consuming and burdensome. For those who do not have too much knowledge of the marker genes, this approach could cost far more time than expected. Automatic cell-type annotation methods, in contrast, do not suffer from the manual labeling process. Different from the unsupervised methods, automatic cell-type identiﬁcation tools are mainly designed based on supervised learning frameworks. Taking advantage of its fast and precise features, they are becoming predominant tools to identify cell types in single-cell experiments. With the unprecedented boom in the wellannotated scRNA-seq atlas and the rapid promotion of the Human Cell Atlas project11,12, auto-annotation tools are facing a more broad prospect than anytime before. Up to now, 32 autoannotation tools are developed and published13. For example, SingleCellNet14 utilizes a random-forest classiﬁer to solve the cross-platform and cross-species annotation tasks. ACTINN15 implements a simple artiﬁcial neural network to overcome the batch effect. While numerous tools have been established in recent years, most of those often fail to identify the entire population because of the existence of rare cell types. From the perspective of cell composition, scRNA-seq datasets are always imbalanced, which have common cell types and rare cell types. The rare population is a small proportion of cells in the single-cell dataset. For example, the dendritic cell usually takes 1–5% of peripheral blood mononuclear cells (PBMCs), especially in large datasets16,17. When we train an auto-annotation tool, the classiﬁer is consistently unable to learn their information thus hard to identify these cell types in the query dataset. However, these rare populations can be crucial, especially in disease research18. Recently, some cluster detection methods have noticed this point19,20 but few classiﬁcation methods focused on the cell population imbalance. Meanwhile, we also ﬁnd that the existing methods have two other main deﬁciencies. (1) Lack of scalability. Recent scRNA-seq experimental platforms enable investigations of million-level cells21,22. Notably, one of the most recent COVID PBMC atlas has reached 1.5 million cells17. Thus computation speed restriction will render auto-annotation packages poorly scalable for the million-level dataset. Moreover, largescale reference datasets add more challenges for learning rare cell types in classiﬁer training, which leads current software more difﬁcult to identify minor groups. Most recently published paper has elevated the training scale to 600 K cells23, however, no published tools successfully report scalability on the million-level cell atlas. (2) Compatibility of the existing tools is not as good as expected. Among the existing Python-based tools, most of the tools such as 2 ACTINN15, scPretrain24, scCapNet25, and MarkerCount26 are script-based. Considering that Seurat and Scanpy are both packages that can be downloaded from a standard software repository (e.g., PyPI), running an external Python script on the server will add an additional burden to the user. In addition, some of the tools are no longer maintained or are not able to use. All these challenges together make a new annotation tool that has a balanced ability to label major and minor cell types in a scalable manner become necessary. Here, we introduce scBalance, a sparse neural network framework that can automatically label rare cell types in scRNA-seq datasets of all scales. scBalance leverages the combination of weight sampling and sparse neural network, whereby minor (rare) cell types are more informative without harming the annotation efﬁciency of the common (major) cell populations. We evaluated scBalance on real datasets with varying degrees of cell population imbalance and scale on both intra- and interdataset (...truncated)