A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data
ARTICLE
https://doi.org/10.1038/s42003-023-04928-6
OPEN
A scalable sparse neural network framework for
rare cell type annotation of single-cell
transcriptome data
1234567890():,;
Yuqi Cheng
1,2, Xingyu Fan3, Jianing Zhang1 & Yu Li
1,4 ✉
Automatic cell type annotation methods are increasingly used in single-cell RNA sequencing
(scRNA-seq) analysis due to their fast and precise advantages. However, current methods
often fail to account for the imbalance of scRNA-seq datasets and ignore information from
smaller populations, leading to significant biological analysis errors. Here, we introduce
scBalance, an integrated sparse neural network framework that incorporates adaptive weight
sampling and dropout techniques for auto-annotation tasks. Using 20 scRNA-seq datasets
with varying scales and degrees of imbalance, we demonstrate that scBalance outperforms
current methods in both intra- and inter-dataset annotation tasks. Additionally, scBalance
displays impressive scalability in identifying rare cell types in million-level datasets, as shown
in the bronchoalveolar cell landscape. scBalance is also significantly faster than commonly
used tools and comes in a user-friendly format, making it a superior tool for scRNA-seq
analysis on the Python-based platform.
1 Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong SAR, China. 2 School of Computational
Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA. 3 School of Information and Software Engineering, University of Electronic
Science and Technology of China, 610054 Chengdu, China. 4 The CUHK Shenzhen Research Institute, Hi-Tech Park, Nanshan, 518057 Shenzhen, China.
✉email:
COMMUNICATIONS BIOLOGY | (2023)6:545 | https://doi.org/10.1038/s42003-023-04928-6 | www.nature.com/commsbio
1
ARTICLE
S
COMMUNICATIONS BIOLOGY | https://doi.org/10.1038/s42003-023-04928-6
ince the first establishment of single-cell RNA sequencing
(scRNA-seq) by Tang et al. in 20091, this technology has
rapidly become popular among scientists in various biological research fields. Compared with traditional bulk RNA
sequencing which only measures the average gene expression
level of the samples, scRNA-seq provides a powerful method to
profile transcriptomes on the cell-specific level. Therefore, it
could enable analyzing individual cells and give a more informative insight into cell heterogeneity. The development of
scRNA-seq technology has been widely used in several biological
research areas, such as cancer research2,3, COVID analysis4,5,
developmental biology research6, etc. In these studies, uncovering
and identifying cellular populations is one of the most critical
tasks.
Typically, cell-type annotation involves two steps: (1) clustering cells into different subgroups and (2) labeling each group with
a specific type manually based on the prior-known marker genes.
A number of unsupervised machine-learning algorithms have
been developed, including classical machine-learning-based
methods such as Seurat7 and Scanpy8, and newly published
deep learning-based methods, such as scDHA9 and CLEAR10.
However, these methods can be time-consuming and burdensome. For those who do not have too much knowledge of the
marker genes, this approach could cost far more time than
expected. Automatic cell-type annotation methods, in contrast,
do not suffer from the manual labeling process. Different from
the unsupervised methods, automatic cell-type identification tools
are mainly designed based on supervised learning frameworks.
Taking advantage of its fast and precise features, they are
becoming predominant tools to identify cell types in single-cell
experiments. With the unprecedented boom in the wellannotated scRNA-seq atlas and the rapid promotion of the
Human Cell Atlas project11,12, auto-annotation tools are facing a
more broad prospect than anytime before. Up to now, 32 autoannotation tools are developed and published13. For example,
SingleCellNet14 utilizes a random-forest classifier to solve the
cross-platform and cross-species annotation tasks. ACTINN15
implements a simple artificial neural network to overcome the
batch effect.
While numerous tools have been established in recent years,
most of those often fail to identify the entire population because of
the existence of rare cell types. From the perspective of cell composition, scRNA-seq datasets are always imbalanced, which have
common cell types and rare cell types. The rare population is a
small proportion of cells in the single-cell dataset. For example, the
dendritic cell usually takes 1–5% of peripheral blood mononuclear
cells (PBMCs), especially in large datasets16,17. When we train an
auto-annotation tool, the classifier is consistently unable to learn
their information thus hard to identify these cell types in the query
dataset. However, these rare populations can be crucial, especially
in disease research18. Recently, some cluster detection methods
have noticed this point19,20 but few classification methods focused
on the cell population imbalance. Meanwhile, we also find that the
existing methods have two other main deficiencies. (1) Lack of
scalability. Recent scRNA-seq experimental platforms enable
investigations of million-level cells21,22. Notably, one of the most
recent COVID PBMC atlas has reached 1.5 million cells17. Thus
computation speed restriction will render auto-annotation packages poorly scalable for the million-level dataset. Moreover, largescale reference datasets add more challenges for learning rare cell
types in classifier training, which leads current software more difficult to identify minor groups. Most recently published paper has
elevated the training scale to 600 K cells23, however, no published
tools successfully report scalability on the million-level cell atlas. (2)
Compatibility of the existing tools is not as good as expected.
Among the existing Python-based tools, most of the tools such as
2
ACTINN15, scPretrain24, scCapNet25, and MarkerCount26 are
script-based. Considering that Seurat and Scanpy are both packages
that can be downloaded from a standard software repository (e.g.,
PyPI), running an external Python script on the server will add an
additional burden to the user. In addition, some of the tools are no
longer maintained or are not able to use. All these challenges
together make a new annotation tool that has a balanced ability to
label major and minor cell types in a scalable manner become
necessary.
Here, we introduce scBalance, a sparse neural network framework that can automatically label rare cell types in scRNA-seq
datasets of all scales. scBalance leverages the combination of
weight sampling and sparse neural network, whereby minor
(rare) cell types are more informative without harming the
annotation efficiency of the common (major) cell populations.
We evaluated scBalance on real datasets with varying degrees of
cell population imbalance and scale on both intra- and interdataset (...truncated)