Treffer: Learning discriminative and structural samples for rare cell types with deep generative model.

Title:
Learning discriminative and structural samples for rare cell types with deep generative model.
Authors:
Wang H; School of Computer Science and Technology, Xidian University, Xi'an, 710071, China., Ma X; School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.
Source:
Briefings in bioinformatics [Brief Bioinform] 2022 Sep 20; Vol. 23 (5).
Publication Type:
Journal Article; Research Support, Non-U.S. Gov't
Language:
English
Journal Info:
Publisher: Oxford University Press Country of Publication: England NLM ID: 100912837 Publication Model: Print Cited Medium: Internet ISSN: 1477-4054 (Electronic) Linking ISSN: 14675463 NLM ISO Abbreviation: Brief Bioinform Subsets: MEDLINE
Imprint Name(s):
Publication: Oxford : Oxford University Press
Original Publication: London ; Birmingham, AL : H. Stewart Publications, [2000-
Contributed Indexing:
Keywords: Adversarial Learning; Cell Clustering; Deep Learning; Rare Cell Type; Single-cell RNA-seq
Substance Nomenclature:
63231-63-0 (RNA)
Entry Date(s):
Date Created: 20220801 Date Completed: 20220926 Latest Revision: 20221207
Update Code:
20250114
DOI:
10.1093/bib/bbac317
PMID:
35914950
Database:
MEDLINE

Weitere Informationen

Cell types (subpopulations) serve as bio-markers for the diagnosis and therapy of complex diseases, and single-cell RNA-sequencing (scRNA-seq) measures expression of genes at cell level, paving the way for the identification of cell types. Although great efforts have been devoted to this issue, it remains challenging to identify rare cell types in scRNA-seq data because of the few-shot problem, lack of interpretability and separation of generating samples and clustering of cells. To attack these issues, a novel deep generative model for leveraging the small samples of cells (aka scLDS2) is proposed by precisely estimating the distribution of different cells, which discriminate the rare and non-rare cell types with adversarial learning. Specifically, to enhance interpretability of samples, scLDS2 generates the sparse faked samples of cells with $\ell _1$-norm, where the relations among cells are learned, facilitating the identification of cell types. Furthermore, scLDS2 directly obtains cell types from the generated samples by learning the block structure such that cells belonging to the same types are similar to each other with the nuclear-norm. scLDS2 joins the generation of samples, classification of the generated and truth samples for cells and feature extraction into a unified generative framework, which transforms the rare cell types detection problem into a classification problem, paving the way for the identification of cell types with joint learning. The experimental results on 20 datasets demonstrate that scLDS2 significantly outperforms 17 state-of-the-art methods in terms of various measurements with 25.12% improvement in adjusted rand index on average, providing an effective strategy for scRNA-seq data with rare cell types. (The software is coded using python, and is freely available for academic https://github.com/xkmaxidian/scLDS2).
(© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)

AN0159311821;g0y01sep.22;2022Sep27.08:48;v2.2.500

Learning discriminative and structural samples for rare cell types with deep generative model 

Cell types (subpopulations) serve as bio-markers for the diagnosis and therapy of complex diseases, and single-cell RNA-sequencing (scRNA-seq) measures expression of genes at cell level, paving the way for the identification of cell types. Although great efforts have been devoted to this issue, it remains challenging to identify rare cell types in scRNA-seq data because of the few-shot problem, lack of interpretability and separation of generating samples and clustering of cells. To attack these issues, a novel deep generative model for leveraging the small samples of cells (aka scLDS2) is proposed by precisely estimating the distribution of different cells, which discriminate the rare and non-rare cell types with adversarial learning. Specifically, to enhance interpretability of samples, scLDS2 generates the sparse faked samples of cells with |$\ell _1$| -norm, where the relations among cells are learned, facilitating the identification of cell types. Furthermore, scLDS2 directly obtains cell types from the generated samples by learning the block structure such that cells belonging to the same types are similar to each other with the nuclear-norm. scLDS2 joins the generation of samples, classification of the generated and truth samples for cells and feature extraction into a unified generative framework, which transforms the rare cell types detection problem into a classification problem, paving the way for the identification of cell types with joint learning. The experimental results on 20 datasets demonstrate that scLDS2 significantly outperforms 17 state-of-the-art methods in terms of various measurements with 25.12% improvement in adjusted rand index on average, providing an effective strategy for scRNA-seq data with rare cell types. (The software is coded using python, and is freely available for academic https://github.com/xkmaxidian/scLDS2).

Keywords: Cell Clustering; Rare Cell Type; Adversarial Learning; Deep Learning; Single-cell RNA-seq

Abstract

Introduction

Cells are the basic building blocks of tissues and organisms, and multicellular species execute their critical biological processes with an amount of cells, forming various cell subpopulations (cell types). Evidence demonstrates that the generation, development and progression of complex diseases result in different cell types with various structure and functions [[1], [3]]. Thus, it is of great significance to accurately identify cell types that explicitly characterize micro-environment and heterogeneity of tissues, which is the foundation of the diagnosis and therapy of complex diseases [[4]]. The conventional approaches utilize physical attributes of cells, such as shapes, sizes and surface proteins, to identify cell types, which are criticized for the undesirable performance because physical features for most cells are missing [[5]]. Therefore, there is a critical need to design effectively tools for cell types.

Fortunately, single-cell RNA-sequencing (scRNA-seq) measures gene expression level at single-cell resolution, which provides a great opportunity to characterize and model the diversity and heterogeneity of tissues by exploiting expression profiles of cells, facilitating the understanding of the complex mechanisms of systems [[6]]. Based on the biological assumption that cells belonging to the same type exhibit the same or similar expression patterns, the computational algorithms perform the identification of cell types by manipulating the expression profiles of cells, rather than the physical features of cells in the traditional methods [[7], [9]]. Discovering cell types in scRNA-seq data corresponds to the classic clustering problem in machine learning, which assigns cells into groups such that cells within the same groups are highly similar, and dissimilar across groups.

However, it is highly nontrivial for designing algorithms for cell types because of dropout events [[10], [12]], curse of dimensionality [[13]] and heterogeneity [[14]]. On the regard of dropout events, efforts have been devoted to this issue, such as CIDR [[16]] and DCA [[17]], by exploiting the similarity of genes in terms of expression profiles. However, these algorithms are criticized for failing to address the extraordinary sparsity of scRNA-seq data. To address this issue, dimension reduction becomes a prerequisite to the identification of cell types with an immediate purpose to extract principle features to effectively represent cells with various strategies [[18]]. For example, principle component analysis (PCA) [[19]] projects expression profiles into a low-dimensional space to extract features with the highest variance, whereas UMAP [[20]] and t-SNE [[21]] obtain the low-dimensional features of cells by embedding learning.

On the basis of strategies, current algorithms for cell types can roughly be categorized into four classes, i.e. the similarity-, ensemble-, matrix factorization- and deep learning-based methods. The similarity-based approaches first calculate the similarity of cells in terms of expression profiles, and then perform clustering analysis of cells by exploiting the similarity matrix. For example, K-means iteratively updates centers of clusters and assigns each cell to the nearest on with the Euclidean distance [[22]], whereas hierarchical clustering constructs a dendrogram of cells by aggregately exploring the similarity of cells [[23]]. These algorithms are popular for the simplicity, and criticized for the low accuracy since the only similarity is insufficient to characterize cell types. To overcome these issues, the ensemble-based methods first independently execute multiple clustering algorithms, then generate a consensus results by manipulating the outputs of these algorithms. In comparison of the similarity-based methods, ensemble-based methods implicitly utilize multiple similarities of cells, thereby improving the accuracy and robustness of algorithms. The typical algorithms include SC3 [[24]], SAFE [[25]] and SAME [[26]].

Even though these algorithms achieve an excellent performance of the identification of cell types, similarity of cells fail to fully characterize cells largely because of the nonlinearity of cells. To address this issue, matrix factorization-based algorithms aim to extract the latent features of cells by projecting expression profiles into a space where cells are well represented. Many algorithms have been developed with various strategies to select spaces. For example, SOUP [[27]] obtains features of cells by using nonnegative matrix factorization (NMF), which dramatically improves the performance of algorithms. To further improve the performance of algorithms, DRjCC [[28]] and jSRC [[29]] integrate dimension reduction and clustering of cells, where features of cells are learned under the guidance of clustering of cell types, resulting in the discriminative features.

But, evidence demonstrates that the latent features of cells obtained by matrix factorization cannot fully capture the heterogeneity of scRNA-seq data [[30]]. There is a critical need for effective algorithms to track the complicated heterogeneity of cell types. Recently, deep learning provides an alternative for matrix factorization, where the deep features are learned to aggressively represent the original entities. And, deep learning is successfully applied to analysis of genomic data, and outperforms the traditional methods [[31]]. On the basis of strategies for feature extraction, deep learning methods for the identification of cell types can be divided into two classes, i.e. auto-encoder- and GAN-based algorithms, where the former ones extract deep features of cells by minimizing the reconstruction error, and the latter ones estimate the distribution of features. Typical auto-encoder-based algorithms include netAE [[32]], scVAE [[33]], scGMAI [[34]], scDeepCluster [[35]] and DESC [[30]], which learn the deep features and then perform clustering of cells with traditional machine learning methods. However, auto-encoder is criticized for failing to precisely extract features of cells from datasets with small samples. To address this issue, GAN generates samples by estimating the distribution of features. scGAN [[36]] utilize GAN to obtain cell types by exploiting features of cells, which significantly improves the performance of algorithms.

Although deep learning achieves an excellent performance on clustering of cells, there are still many unsolved problems. First, current algorithms fail to identify rare cell types because of the few-shot problem. However, rare cell types correspond to the transition status of cells, which is the foundation for revealing the development and progression of complex diseases. Even though generative adversarial networks (GANs) [[37]] can address the few-shot problem, these methods only focus on the imputation of scRNA-seq data [[39], [41]]. Thus, this is a critical need for effective algorithms for the identification of rare cell types. Second, the available deep learning algorithms are criticized for the 'black-box' phenomenon, where users only obtain the output of cell types without features of cells. Thus, the interpretability of patterns is not desirable, deviating from the expectation of algorithms. Third, the existing algorithms separate the generation of samples and clustering of cells, where the quality of generated samples deviates from the expectation, thereby resulting in an undesirable performance on the identification of rare cell types.

To address these problems, a novel deep generative model (called scLDS2) for identifying rare cell populations from scRNA-seq data is proposed by precisely estimating the distribution of different cells, which discriminates rare and non-rare cell types with adversarial learning. As shown in Figure 1, scLDS2 consists of three major components, i.e. samples generation, feature extraction and cell type detection, where scLDS2 estimates the distributions of rare cells of scRNA-seq data to generate fake samples. To improve the interpretability of sample features, scLDS2 generates the sparse and structural faked samples of cells by imposing sparse and structural constraints, improving the detection of cell types. scLDS2 utilizes classifier to distinguish the generated and real samples for rare cells, where the generation of fake samples and identification of rare cell types are jointly learned. In this case, scLDS2 jointly learns rare cell samples by incorporating these two components into an overall objective function, thereby improving algorithm performance. By applying scLDS2 to 20 scRNA-seq datasets across various platforms and tissues, the experimental results demonstrate that the proposed algorithm outperforms baselines on the identification of cell types in terms of various measurements.

Graph: Figure 1 class="chapter-para">Overview of the scLDS2 algorithm. It consists of three components, i.e. sample generation, feature extraction and cell type detection, where the first component addresses the few-shot problem by estimating the distributions of cell types with adversarial learning, the feature extraction procedure learns the sparse and structural features of cells and finally it identifies cell types.

Algorithm

In this section, the mathematical model, optimization rules and informative gene selection of the proposed algorithm are addressed.

ModelPrior to formulating the model of scLDS2, let us introduce some terminologies that are commonly used. Let the upper, lower and bold lower case letters be matrices, scalars and vectors, respectively. Let |$Z=[\textbf{z}_{1},\ldots ,\textbf{z}_{m}] \in R^{d\times m}$| and |$Y=[\textbf{y}_{1},\ldots ,\textbf{y}_{m}]\in R^{d\times m}$| be the input random variables, and latent features of cells, respectively, where |$d$| is the number of features. The expression profile of |$n$| genes with |$m$| cells is denoted by |$X=[\textbf{x}_{1},\ldots ,\textbf{x}_{m}] \in{R^{{n\times m}}}$|⁠. Let the upper case and bold letters be functions/models. Let |$\|X\|$| and |$\|X\|_{1}$| be the |$\ell _{2}$|-norm and |$\ell _{1}$|-norm of |$X$|⁠, i.e. |$\|X\|=\sum _{ij}\sqrt{x_{ij}^{2}}$|⁠, and |$\|X\|_{1}=\sum _{ij}|x_{ij}|$|⁠, respectively. The nuclear-norm of matrix |$X$| is defined as [[42]]

$$\begin{align}& \|X\|_{*} = \sum \sigma(X), \end{align}$$

(1)where |$\sigma (X)$| is the singular value of |$X$|⁠.

The overview of scLDS2 is shown in Figure 1, which consists of three components, i.e. sample generation, feature extraction and cell type detection, where the first procedure estimates the distribution of rare cells with the deep adversarial learning, feature extraction learns the sparse and structural features of cells and the last one identifies clusters of cells. The sample generation and feature extraction alternate until the algorithm is convergent.

On the sample generation issue, scLDS2 uses the deep adversarial learning to estimate the distributions of cells, which involves a generative model |$\textbf{G}$| to generate the faked samples and a classifier |$\textbf{D}$| to discriminate the faked and true samples of cells. In the principle of adversarial learning [[38]], |$\textbf{G}$| generates the faked samples |$\textbf{G}(Z)$| close to the truth samples to deceive the classifier, whereas |$\textbf{D}$| discriminates the true and generated samples. The alternation of generation and discrimination of samples estimates the distribution of cells, which is promising for few-shot learning [[43]]. Specifically, given scRNA-seq profile |$X$|⁠, the generative model |$\textbf{G}$| generates fakes samples |$\hat{X}$| as

$$\begin{align}& \hat{X} = \textbf{G}(Z). \end{align}$$

(2)We expect the generated samples |$\hat{X}$| are close to |$X$| such that the classifier |$\textbf{D}$| fails to discriminate them, which can be fulfilled by minimizing the loss. Therefore, the cost for generating samples is formulated as

$$\begin{align}& \mathcal{O}(X,Z) = \min_{\theta_{G}} \mathbb{E}_{Z\sim P_{z}}\phi(1-\textbf{D}(\textbf{G}(Z))), \end{align}$$

(3) where |$\phi $| is a predefined function, |$\mathbb{E}$| corresponds to the expectation, |$\theta _{G}$| is the parameter for the generative model and |$P_{z}$| is the distribution of |$Z$|⁠. Then, the classifier discriminates the faked and true samples by maximizing them, and Eq.(3) is reformulated as

$$\begin{align}& \mathcal{O}(X,Z) = \min_{\theta_{G}} \max_{\theta_{D}} \mathbb{E}_{X\sim P_{x}}\phi(\textbf{D}(X))+\mathbb{E}_{Z\sim P_{z}}\phi(1-\textbf{D}(\textbf{G}(Z))), \end{align}$$

(4)where |$\theta _{D}$| is parameter of classifier |$\textbf{D}$| and |$P_{x}$| is the distribution of |$X$|⁠.On the feature extraction issue, the generated samples |$\textbf{G}(Z)$| are high-dimensional, where dimension reduction is adopted to project them into a low-dimensional space. scLDS2 selects the fully connected network (FCN) E to learn the latent features |$Y$|⁠, which is formulated as

$$\begin{align}& Y= \textbf{E}(\textbf{G}(Z)). \end{align}$$

(5) scLDS2 solves Eq.(5) by minimizing the approximation, i.e.

$$\begin{align}& \mathcal{O}(Y) = \|Y-\textbf{E}(\textbf{G}(Z))\|^{2}. \end{align}$$

(6)Evidence demonstrates that spare representation of features [[29]] not only improves interpretability of patterns, but also reduces the complexity of algorithms because it avoids unnecessary computation. scLDS2 hypothesizes that cells belonging to the same types are similar to each other in terms of features, implying that features of various types differ greatly. In this case, the relation of features of various cell types is limited, where sparsity constraint satisfies the requirement. scLDS2 employs |$\ell _{1}$|-norm to fulfil the goal, and Eq.(6) is reformulated as

$$\begin{align}& \mathcal{O}(Y) = \|Y-\textbf{E}(\textbf{G}(Z))\|^{2} + \|Y\|_{1}. \end{align}$$

(7)Eq.(7) ensures the sparsity of features, which facilitates the interpretability and computation. Furthermore, we also expect features of cells |$Y$| to reflect the structure of types to identify clusters of cells. Evidence demonstrates that nuclear-norm projects data in the original space into low-rank subspace, where different clusters are discriminated, forming block structure [[42]]. Thus, Eq.(7) is rewritten as

$$\begin{align}& \mathcal{O}(Y) = \|Y-\textbf{E}(\textbf{G}(Z))\|^{2} + \alpha \|Y\|_{1} + \beta\|Y\|_{*}, \end{align}$$

(8)where parameters |$\alpha $| and |$\beta $| control the relative importance of sparsity and structural constraint, respectively.

The sample generation and feature extraction alternate until algorithm converges. The optimization rules are deduced, and complexity analysis is performed (Supplementary Section S1). The strategy for determining the number of clusters in our previous study is adopted [[29]] (Supplementary Section S1). After obtaining features of cells, scLDS2 performs clustering of cells by using K-means, and the procedure of scLDS2 is illustrated in Algorithm 1.

Graph

Informative gene selectionInformative genes are bio-markers to discriminate cell types, which can be selected by using the output of scRNA-seq. Given scRNA-seq profile |$X$|⁠, scLDS2 directly obtains features of cells |$Y$| and cell types. NMF is employed to extract features of genes by minimizing the approximation, i.e.

$$\begin{align}& \|X-BY\|^{2}, \quad B\geq 0. \end{align}$$

(9) scLDS2 utilizes the Lasso to perform gene selection on the reconstructed data, i.e. |$BY$|⁠, where these genes with nonzero coefficients are selected. Specifically, the objective for Lasso is formulated as

$$\begin{align}& \mathcal{L} = \|\textbf{c}-BY\textbf{w}\|_{2}^{2}+ \lambda\|\textbf{w}\|_{1}, \end{align}$$

(10)where |$\textbf{c}$| and |$\textbf{w}$| are the labels of cells and coefficients of genes respectively, and parameter |$\lambda $| controls constraint of sparsity.

Materials

Baselines

To validate the performance of the proposed algorithm, 17 state-of-the-art baselines are selected for a comparison, including PCA [[19]], t-SNE [[21]], UMAP [[20]], NMF [[56]], DCA [[17]], scDeepCluster [[35]], scAIDE [[57]], GiniClust3 [[58]], CIDR [[16]], K-means [[22]], SIMLR [[59]], spectral clustering (SEC) [[60]], SOUP [[27]], SAME [[26]], SC3 [[24]], Seurat [[61]] and GAN [[38]]. The first four algorithms are selected because scLDS2 also performs dimension reduction, and DCA and CIDR are included since the proposed method also executes imputation of scRNA-seq data. The rest baselines are chosen largely because of the popularity and excellent performance of these algorithms.

Datasets

A total of 20 scRNA-seq datasets are selected for experiments, including two simulated, 16 moderate-scale and two large-scale ones. Specifically, the simulated datasets with rare cell types are Splat1 and Splat2 [[44]]. Moderate-scale datasets include six mouse scRNA-seq datasets, Biase [[46]], Ting [[47]], Zeisel [[51]], Klein[[50]], Mouse1 [[45]] and Mouse2 [[45]], and 10 human datasets, including Camp [[48]] for fetal brain, CellBench1, CellBench2, CellBench3 [[49]], Tirosh [[52]], Christopher [[53]], Human1, Human2, Human3 and Human4 [[45]] for the pancreas. Large-scale datasets include Tabula [[55]] for mouse and COVID-19 [[54]] of bronchoalveolar lavage fluid (BALF) for human. The statistics of these benchmark datasets are summarized in Table 1. Cells have different scales due to the sequencing depths and cell sizes; the transcript per million normalization is adopted as SOUP [[27]].

Table 1 Statistics of scRNA-seq datasets in experiments, where # cells and # types correspond to the number of cells, and cell types, and Organ denotes the tissues of cells.

<table><thead><tr><th>Species. </th><th>Dataset. </th><th>Species. </th><th>#Cells. </th><th>#Types. </th><th>Organ. </th><th>Platform. </th><th>Refs.. </th></tr></thead><tbody><tr><td>Simulate </td><td>Splat1 </td><td>Simulate </td><td>4248 </td><td>5 </td><td>&#8211; </td><td>&#8211; </td><td>[<xref ref-type="bibr" rid="bibr44">44</xref>] </td></tr><tr><td /><td>Splat2 </td><td>Simulate </td><td>4286 </td><td>4 </td><td>&#8211; </td><td>&#8211; </td><td>[<xref ref-type="bibr" rid="bibr44">44</xref>] </td></tr><tr><td /><td>Mouse1 </td><td>Mouse </td><td>822 </td><td>13 </td><td>Pancreas </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr45">45</xref>] </td></tr><tr><td /><td>Mouse2 </td><td>Mouse </td><td>1064 </td><td>13 </td><td>Pancreas </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr45">45</xref>] </td></tr><tr><td>with Rare </td><td>Human1 </td><td>Human </td><td>1937 </td><td>14 </td><td>Pancreas </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr45">45</xref>] </td></tr><tr><td>Cell Types </td><td>Human2 </td><td>Human </td><td>1725 </td><td>14 </td><td>Pancreas </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr45">45</xref>] </td></tr><tr><td /><td>Human3 </td><td>Human </td><td>3605 </td><td>14 </td><td>Pancreas </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr45">45</xref>] </td></tr><tr><td /><td>Human4 </td><td>Human </td><td>1303 </td><td>14 </td><td>Pancreas </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr45">45</xref>] </td></tr><tr><td /><td>Biase </td><td>Mouse </td><td>56 </td><td>4 </td><td>Fetal brain </td><td>SMART-Seq SMARTer </td><td>[<xref ref-type="bibr" rid="bibr46">46</xref>] </td></tr><tr><td /><td>Ting </td><td>Mouse </td><td>187 </td><td>7 </td><td>Tumors </td><td>scRNA-Seq, modified Tang2010 protocol </td><td>[<xref ref-type="bibr" rid="bibr47">47</xref>] </td></tr><tr><td /><td>Camp </td><td>Human </td><td>220 </td><td>7 </td><td>Fetal brain </td><td>SMART-Seq SMARTer </td><td>[<xref ref-type="bibr" rid="bibr48">48</xref>] </td></tr><tr><td /><td>CellBench1 </td><td>Human </td><td>225 </td><td>3 </td><td>cancer cell </td><td>Drop-seq Dolomite </td><td>[<xref ref-type="bibr" rid="bibr49">49</xref>] </td></tr><tr><td /><td>CellBench2 </td><td>Human </td><td>297 </td><td>3 </td><td>cancer cell </td><td>CEL-seq2 </td><td>[<xref ref-type="bibr" rid="bibr49">49</xref>] </td></tr><tr><td>without Rare </td><td>Klein </td><td>Mouse </td><td>2717 </td><td>4 </td><td>Embryonic </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr50">50</xref>] </td></tr><tr><td>Cell Types </td><td>Zeisel </td><td>Mouse </td><td>3005 </td><td>7 </td><td>Mouse brain </td><td>Quantitative scRNA-seq with UMI </td><td>[<xref ref-type="bibr" rid="bibr51">51</xref>] </td></tr><tr><td /><td>CellBench3 </td><td>Human </td><td>3918 </td><td>3 </td><td>cancer cell </td><td>10X Chromium </td><td>[<xref ref-type="bibr" rid="bibr49">49</xref>] </td></tr><tr><td /><td>Tirosh </td><td>Human </td><td>4645 </td><td>7 </td><td>Tumor </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr52">52</xref>] </td></tr><tr><td /><td>Christopher </td><td>Human </td><td>5902 </td><td>9 </td><td>Oral Cavity </td><td>inDrop </td><td>[<xref ref-type="bibr" rid="bibr53">53</xref>] </td></tr><tr><td /><td>COVID-19 </td><td>Human </td><td>63 103 </td><td>10 </td><td>BALF </td><td>10x Genomics </td><td>[<xref ref-type="bibr" rid="bibr54">54</xref>] </td></tr><tr><td /><td>Tabula </td><td>Mouse </td><td>110 824 </td><td>120 </td><td>20 organs </td><td>10x Genomics </td><td>[<xref ref-type="bibr" rid="bibr55">55</xref>] </td></tr></tbody></table>

Evaluation Measurement

To fully evaluate the performance of various algorithms, three commonly used measurements for cell types are selected, i.e. adjusted rand index (ARI), Normalized mutual information (NMI) and Accuracy (ACC). Moreover, the clustering-based indexes, including compactness and separation, are adopted to measure the performance of various algorithms.

ARI [[62]] is defined as

$$\begin{align} ARI(c^{*},c)=\dfrac{\sum\limits_{e,t}\binom{m_{et}}{2}-\left[\sum\limits_{e}\binom{m_{e}}{2}\sum\limits_{t}\binom{m_{t}}{2}\right]/\binom{m}{2}}{\frac{1}{2}\left[\sum\limits_{e}\binom{m_{e}}{2}+\sum\limits_{t}\binom{m_{t}}{2}\right]-\left[\sum\limits_{e}\binom{m_{e}}{2}\sum\limits_{t}\binom{m_{t}}{2}\right]/\binom{m}{2}}, \end{align}$$

(1) where |$m$| is the total number of single cells, |${m_e}$| and |${m_t}$| are the number of single cells in the predicted cluster |$e$| and in truth cluster |$t$| and |${m_{et}}$| is the number of single cells shared by |$e$| and |$t$|⁠.Normalized mutual information (NMI) is the most widely used measurement for cell types. NMI is defined as

$$\begin{align}& NMI(C,C^{*}) = \frac{{ - 2\sum\limits_{i = 1}^{\left| {{C^*}} \right|} {\sum\limits_{j = 1}^{\left| C \right|} {{h_{ij}}\log \frac{{{h_{ij}}H}}{{{h_{i.}}{h_{.j}}}}}}} }{{\sum\limits_{i = 1}^{\left| {{C^*}} \right|} {{h_{i.}}\log \frac{{{h_{i.}}}}{H} + } \sum\limits_{jl = 1}^{\left| C \right|} {{h_{.j}}\log \frac{{{h_{.j}}}}{H}}}}, \end{align}$$

(2)where |$H$| is a confusion matrix whose rows and columns correspond to the cells in truth cluster |${C^*}$| and predicted cluster |$C$| with element |${h_{ij}}$| as the number of vertices overlapped by the |$i$|-th real and |$j$|-th obtained cluster, |$|C_{i}^{*}|$| is the number of cluster |${C^*}$| and |${h_i.}$| is the sum of the |$i$|-th row of |$H$|⁠. ACC is also used to measure the performance of algorithms, which is defined as

$$\begin{align}& Acc = \frac{1}{k}\sum_{i = 1}^{k} \delta (C_{i},C_{i}^{*}), \end{align}$$

(3)where |$\delta (x,y)$| is 1 if |$x=y$|⁠, 0 otherwise.Compactness (CP) is the average distance of cells to the center of the corresponding cell types, which is defined as

$$\begin{align}& CP = \frac{1}{k}\sum\limits_{i = 1}^{k}{\frac{1}{\|C_{i}\|} \sum\limits_{\textbf{x}_{j}\in \Omega_{k}}\|\textbf{x}_{j}-\textbf{c}_{i}\|}, \end{align}$$

(4)where |$k$| and |$\textbf{c}_{i}$| are the number of clusters and the center of cluster |$C_{i}$|⁠, respectively. Separation(SP) is the average distance among the centers of various cell types, which is defined as

$$\begin{align}& SP = \frac{2}{k^2-k}\sum\limits_{i = 1}^{k-1}\sum\limits_{j=i+ 1}^{k}{\|\textbf{c}_{i}-\textbf{c}_{j}\|}^{2}. \end{align}$$

(5)

Results

To fully validate the performance of scLDS2, a comparative comparison among 17 baselines on 20 datasets is performed for a comparative comparison.

scLDS2 is scalable for large-scale scRNA-seq data

Before describing performance of various algorithms, we investigate parameter effects. Recall that three parameters are involved in scLDS2, where |$\alpha $| and |$\beta $| control the importance of sparsity and structural constraint, and |$d$| is the number of dimensions. For each parameter, we check how ARI of scLDS2 changes by varying values of parameters with fixed others, where values of parameters are empirically selected. How parameter |$d$| effects performance of scLDS2 is shown in Figure 2 A, where performance of scLDS2 increases from 10 to 50, and it decreases as the number of dimensions when |$d$| is greater than 50. When the dimension is small, features of cells cannot fully characterize the structure of cell types. When the dimension is large, features of cells are redundant, thereby resulting in undesirable performance. |$d$|=50, scLDS2 achieves the best performance.

Graph: Figure 2 class="chapter-para">Parameter effect analysis. How performance of scLDS2 changes with various values of parameters on different datasets: (A) ARI versus. |$d$|⁠, (B) ARI versus |$\alpha $|⁠, (C) ARI versus |$\beta $|⁠, (D) Convergence analysis of scLDS2 on the Tabula dataset, (E) Running time of various algorithms on the COVID-19 dataset and (F) Scalability analysis of scLDS2.

How parameter |$\alpha $| effects performance of scLDS2 is shown in Figure 2 B, where value of |$\alpha $| increases from 0.0001 to 100. It is easy to assert that scLDS2 is quite stable when |$\alpha $|¡10, and the performance dramatically declines when |$\alpha $|¿10. There is a good reason to explain this tendency. When |$\alpha $| is large, scLDS2 prefers to learning very sparse features of cells, which fails to characterize the structure of cells. scLDS2 achieves a good balance between the sparsity and structural constraints when |$\alpha \in [0.001,10]$|⁠. As shown in Figure 2 C, parameter |$\beta $| has a similar tendency as |$\alpha $|⁠, implying that the sparsity and structural constraint are of equal importance in scLDS2. Thus, we set the number of dimension as 50, and |$\alpha $|=|$\beta $|=1 in all experiments.

Figure 2D illustrates the convergence of scLDS2 on the Tabula dataset, where Y-axis denotes the relative error of objective function, and X-axis is the number of iterations. It is easy to conclude that scLDS2 requires 2000 iterations to converge, implying that the proposed algorithm is efficient compared with deep learning. By taking the large-scale COVID-19 dataset as benchmark, the running time of various algorithms is presented in Figure 2 E, where scLDS2 is superior to DCA, SOUP and SAME. Specifically, the running time on COVID-M is 3.01, 0.42, 0.19, 0.03 h for SAME, SOUP, UMAP and DCA, whereas that of scLDS2 is 0.08 h. Figure 2F presents the scalability analysis of scLDS2, implying that scLDS2 can handle scRNA-seq data with more than 100 000 cells.

scLDS2 precisely identifies rare cell types in artificial data

To validate performance of scLDS2 on the identification of rare cell types, 10 state-of-the-art methods,such as PCA [[19]], t-SNE [[21]], UMAP [[20]], NMF [[56]], DCA [[17]], SAME [[26]], GAN [[38]], SIMLR [[59]], K-means and SEC are deliberately selected. According to [[26]], types are rare if and only if the percentage of cells in types is less than 1%. Two artificial scRNA-seq datasets, Splat1 and Splat2, are selected to benchmark these algorithms [[44]]. Specifically, Splat1 contains 4248 cells and five types, including two rare cell types (Type1 with 42 cells, Type2 with 40 cells) and three non-rare types (Type3 with 1415 cells, Type4 with 1298 cells and Type5 with 1453 cells). Splat2 consists of 4286 cells and four types, including three non-rare (Type1 with 1394 cells, Type2 with 1444 cells and Type3 with 1411 cells) and one rare type (Type4 with 37 cells).

Figure 3A presents the visualization of cell types with different colors on Splat2 based on various methods, where panel A1 is based on the original expression profiles, panel A2 on the features of cells obtained by DCA, panel A3 on the features obtained by scDeepCluster and panel A4 on the features of cells obtained by scLDS2. From Figure 3A1, it is easy to assert that the rare cell type (yellow) are seriously mixed with non-rare cell types, which is difficult to discriminate rare and non-rare cell types. Figure 3A2 demonstrates DCA precisely discriminates non-rare cell types, whereas the rare and non-rare ones are mixed (surrounded by solid circle), implying it cannot characterize and model the structure of rare cell types. Figure 3A3 demonstrates scDeepCluster precisely discriminates non-rare cell types and rare cell types, whereas the non-rare ones are mixed. Figure 3A4 is the visualization of features obtained by scLDS2, where the rare cell type (yellow) is clearly separated from non-rare cell types. These panels demonstrate that scLDS2 is promising for the identification of rare cell types.

Graph: Figure 3 class="chapter-para">Performance of various algorithms for the identification of rare cell types in artificial scRNA-seq datasets: (A) Visualization of cells with the original expression profiles of Splat2 (A1), features of cells obtained by DCA (A2), features of cells obtained by scDeepCluster (A3), features of cells obtained by scLDS2 (A4), (B) Accuracy of various algorithms and (C) ARI of various algorithms.

The performance of various algorithms on the identification of rare cell types is depicted in Figure 3, where panel B is ACC, and C is ARI on the artificial scRNA-seq datasets. Notice that algorithms whose performance is less than the minimum value of Y-axis is missed. From these panels, it is easy to assert that the proposed algorithm is much more precise to identify rare cell types. In details, ACC of scLDS2 is 62.45% on the Splat1 dataset, whereas it is 46.8%, 44.2%, 49.5%, 56.2%, 34.8%, 54.6%, 59.8%, 53.9%, 55.1%, 37.5%, 53.7%, 50.3% and 61.1% for PCA, NMF, t-SNE, UMAP, DCA, GAN, SAME, K-means, SIMILR, SEC, scDeepCluster, scAIDE and GiniClust3, respectively. By replacing Splat1 with Splat2, scLDS2 is still the best algorithm for rare cell types. Furthermore, scLDS2 outperforms baselines for the identification of rare cell types in terms of ARI (NMI, Supplementary Figure SFig. 1), indicating that scLDS2 is insensitive to measurements. There are three reasons to explain why scLDS2 significantly outperforms baselines for rare cell types. First, scLDS2 utilizes deep adversarial learning to estimate the distribution of rare cell types, which is more precise to characterize the structure of rare cell types. Second, dimension reduction of generated samples learns the latent features of cells, which is more accurate to capture the structure of rare cells. Third, the sparse and structural constraint further enhance the quality of features, thereby resulting in an excellent performance.

scLDS2 accurately extracts rare cell types in biological scRNA-seq data

Rare cell types usually correspond to transition status of systems, which are critical for tracking the development and trajectory of cells [[15]]. We investigate the capability of scLDS2 for the identification of rare cell types in biological scRNA-seq data. Six scRNA-seq datasets with rare cell types from various species are selected as shown in Table 1.

By taking Human2 as an example, there are five rare cell types, including epsilon (0.1%), macrophage (0.9%), mast (0.2%), schwann (0.3%) and T cells (0.1%). Figure 4A visualizes cell types in Human2 with various features of cells, where panel A1 is based on the original expression profiles, A2 on the features of cells obtained by DCA, A3 on the features of cells obtained by scDeepCluster and panel A4 on the features of cells obtained by scLDS2. As shown in Figure 4A2 and A3, DCA and scDeepCluster fail to identify rare cell types since these rare cell types are mixed with non-rare ones (surrounded by dashed circle). Interestingly, Figure 4A4 shows that four rare cells, including macrophage, mast, schwann and T cells, are well separated from non-rare cell types, implying that epsilon may shed light on the revealing the mechanism of biological systems [[15]]. The similar tendency repeats for other biological scRNA-seq datasets (Supplementary Figure SFig. 2). These results demonstrate that scLDS2 also accurately captures the structure of rare cell types in biological scRNA-seq data.

Graph: Figure 4 class="chapter-para">Performance of various algorithms for the identification of rare cell types in biological scRNA-seq data: (A) Visualization of cell types in Human2 with the original expression profiles (A1), features of cells obtained by DCA (A2), features of cells obtained scDeepCluster (A3) and features of cells obtained scLDS2 (A4), and (B/C) Accuracy (NMI) of various algorithms on different datasets.

Then, we compare scLDS2 with baselines by applying them to these six scRNA-seq datasets to comprehensively evaluate performance on the identification of rare cell types, and the results are shown in Figure 4. Specifically, Figure 4B reports ACC of various algorithms for rare cell types, where scLDS2 significantly outperforms baselines. In details, ACC of scLDS2 is 75.8% (Human3), 63.7% (Human2), 58.0% (Human1), 67.9% (Human4), 54.2% (Mouse1) and 65.6% (Mouse2), respectively, whereas that for the best baseline is 45.1% (Human3), 41.03% (Human2), 45.2% (Human1), 46.5% (Human4), 47.1% (Mouse1) and 29.5% (Mouse2), respectively. In other words, scLDS2 achieves the best performance in all these datasets, implying that scLDS2 is also promising for modeling and identifying rare cell types in biological datasets. By replacing ACC with NMI, scLDS2 also achieves the best performance for the identification of rare cell types for five datasets (Figure 4C), whereas some algorithms, such as DCA, are very sensitive to measurements. The similar tendency repeats when ARI is used (Supplementary Figure SFig. 3). These panels demonstrate that scLDS2 not only improves the accuracy of algorithms, but also enhances robustness of methods.

There are several reasons to explain why scLDS2 is superior to baselines on the identification of rare cell types. First, scLDS2 joins the deep features and learns the structural and sparse features of cells, which provides a more comprehensive strategy to characterize the structure and features of cells, overcoming the limitation of the traditional machine learning methods. Second, scLDS2 accurately estimates the distributions of rare cell types with deep adversarial learning, and exploits the features of cells on the generated samples, rather than solely on the observed cells, which properly balances the non-rare and rare cell types.

Rare cells facilitate the identification of non-rare cell types

The previous experiments demonstrate that scLDS2 precisely identifies rare cell types. Then, we ask whether the proposed algorithm is also effective for non-rare cell types. Nine state-of-the-art methods, such as K-means [[22]], SIMLR [[59]], Seurat [[61]], spectral clusterin (SEC) [[60]], SOUP [[27]], SAME [[26]], SC3 [[24]] and GAN [[38]], are deliberately selected, covering the typical categories of methods for cell types. Specifically, the first three algorithms are typical clustering-based methods, while SEC and SOUP are matrix factorization-based methods. SAME and SC3 are based on ensemble clustering, whereas GAN is deep learning method. All these algorithms are executed with their suggested values of parameters.

And, 12 scRNA-seq datasets without rare cell types (Table 1) are adopted to validate performance of various methods, which is shown in Figure 5. Surprisingly, scLDS2 achieves the best performance on 9 of 12 datasets, demonstrating that the proposed algorithm also precisely identifies non-rare cell types in scRNA-seq. Specifically, ACC of scLDS2 is 98.0% (Biase), 88.4% (Ting), 73.0% (Camp), 82.7% (CellBench1), 98.0% (CellBench2), 98.6% (CellBench3), 79.5% (Klein), 77.6% (Zeisel), 71.7% (Tirosh), 87.8% (Christopher), 81.8% (COVID-19-M), 78.4% (COVID-19-S) and 66.2% (Tabula), respectively, whereas that of the best baseline is 97.8% (Biase), 86.6% (Ting), 75.9% (Camp), 75.0% (CellBench1), 95.0% (CellBench2), 95.0% (CellBench3), 81.9% (Klein), 82.4% (Zeisel), 56.2% (Tirosh), 74.2% (Christopher), 80.2% (COVID-19-M), 85.2% (COVID-19-S) and 65.2% (Tabula), respectively. scLDS2 also obtains the best performance in terms of NMI and ARI (Supplementary materials SFig. 4), implying that scLDS2 precisely identifies cell types. The reason why scLDS2 is superior to baselines is that the generative adversarial strategy not only precisely estimates distributions of rare cells, but also accurately predicts distributions of non-rare cells.

Graph: Figure 5 class="chapter-para">Performance of various algorithms on the 20 scRNA-seq datasets mixing the non-rare and rare cell types in terms of accuracy.

Then, we check whether scLDS2 can effectively discriminate non-rare and rare cell types in scRNA-seq datasets by applying them to eight datasets, where performance of various algorithms in terms of ACC is shown in Figure 5. It is easy to conclude that the typical clustering-based methods are the worst, and ensemble-based algorithms are inferior to matrix factorization-based ones. For example, ACC of K-means, Seurat and SIMLR is 53.4%, 52.1% and 45.3% on the Human3 dataset, respectively (Supplementary materials Table 1), which are much less than matrix factorization-based algorithms. The reason is that these algorithms directly performs clustering on the expression profiles of scRNA-seq data without fully exploiting features of cells. The ensemble-based methods are better than K-means and SIMLR in terms of accuracy since they integrate outputs of multiple approaches. However, they are inferior to SEC and SOUP because the latent features of cells are more accurate to characterize cell types. These results demonstrate that exploration of discriminative features is promising for the identification of cell types because of noise and imputation in scRNA-seq data. Interestingly, scLDS2 achieves the best performance in all these datasets with non-rare and rare cell types, indicating that the proposed algorithm is more accurate than current baselines. There results further demonstrate that rare cell types facilitate the identification of non-rare cell types.

scLDs2 significantly enhances the quality of features of cells

scLDS2 estimates the distributions of cell types by generating faked samples with deep adversarial learning, and then performs dimension reduction on the generated samples to obtain features of cells. Then, it is natural to ask whether the generated samples can be directly applied to the identification of cell types. Figure 6A visualizes the cell types in the Human3 dataset, where panel A1 is based on expression profiles of the generated samples, and A2 is based on the features of cells with FCN. As shown in Figure 6A1, cells are highly mixed. In details, beta (green) and alpha (yellow) cells are difficult to discriminate since the boundary is not well separated, implying that the generated samples cannot be directly applied to the identification of cell types. However, as shown in Figure 6 A2 cell are well separated in terms of features of cells after dimension reduction with FCN, implying that features of cells are more precise to characterize cell types. The similar tendency repeats for other datasets (Supplementary materials SFig. 5). These results demonstrate that scLDS2 precisely estimates the distributions of cells, whereas the generated samples cannot be directly applied to cell types because of the high-dimension of samples.

Graph: Figure 6 class="chapter-para">Comparison of features of cells learned by various algorithms on the Human3 dataset: (A) Visualization of cell types with features of cells obtained by generator (A1) and scLDS2 (A2), (B) Distribution of compactness of cell types obtained by various algorithms, (C) Distribution of separation of cell types obtained by various algorithms and (D) Comparison of various algorithms on the identification of cell types.

Then, to check whether the dimension reduction strategy adopted by scLDS2 is effective, four typical algorithms, including PCA [[19]], t-SNE [[21]], UMAP [[20]] and NMF [[56]], are selected for a comparison by checking the quality of features. Since there is no cutoff definition for the quality of features, we address this issue by checking how these features facilitate clustering of cells. Therefore, two criteria, compactness and separation, are selected, where compactness quantifies the distance of cells to the center of the corresponding cell types, and separation measures the distance among the centers of various cell types. The distribution of compactness of cell types by various algorithms is shown in Figure 6B, where clusters of cells obtained by scLDS2 is more compact than baselines. Notice that distribution of compactness for UMAP is absent since it is beyond range of distribution (2.36). Specifically, mean of compactness is 0.03, 0.24, 0.025, 2.37 and 0.177 for scLDS2, PCA, t-SNE, UMAP and NMF, respectively, where P-value is 3.0E-4 (scLDS2 versus PCA, Student t-test), 0.72 (scLDS2 versus t-SNE, Student t-test), 1.9E-7 (scLDS2 versus UMAP, Student t-test) and 2.9E-2 (scLDS2 versus NMF, Student t-test). Furthermore, Figure 6C demonstrates that separation of cell types for scLDS2 is significantly higher than those of baselines with P-value as 7.2E-18 (scLDS2 versus PCA), 7.7E-24 (scLDS2 versus t-SNE), 4.0E-6 (scLDS2 versus UMAP) and 2.7E-21 (scLDS2 versus NMF), respectively. In details, mean of separation is 0.90 (scLDS2), 0.49 (PCA), 0.28 (t-SNE), 0.64 (UMAP), 0.32 (NMF), respectively. This tendency occurs in the rest datasets (Supplementary materials SFig. 6 and Supplementary materials SFig. 7).

Moreover, we further investigate whether the quality of features obtained by scLDS2 really enhances the accuracy on the identification of cell types. Performance of various algorithms in terms of ACC is shown in Figure 6D, where scLDS2 achieves the best performance on 16 of 18 real biological scRNA-seq datasets, and has a similar performance as the best baseline on Klein (scLDS2 79.5% versus UMAP 81.8%) and Tirosh (scLDS2 71.7% versus NMF 72.9%). Specifically, ACC of scLDS2 is 98.0% (Biase), 88.4% (Ting), 73.0% (Camp), 82.7% (CellBench1), 98.0% (CellBench2), 98.6% (CellBench3), 77.6% (Mouse1), 74.3% (Mouse2), 81.5% (Human4), 73.3% (Human2), 88.2% (Human1), 79.5% (Klein), 77.6% (Zeisel), 88.4% (Human3), 71.7% (Tirosh), 87.8% (Christopher), 81.8% (COVID-19-M), 78.4% (COVID-19-S) and 66.2% (Tabula), respectively. The improvement of ACC ranges from -2.8% (2 of 18 datasets) to 102.7%. These results demonstrate that scLDS2 significantly outperforms baselines, indicating that scLDS2 efficiently extracts high-quality features of cells with adversarial learning. To check weather the superiority of scLDS2 is sensitive to measurements, performance of various algorithms in terms NMI and ARI further proves that scLDS2 significantly outperforms baselines (Supplementary materials Table 2, Supplementary materials SFig. 8).

There are two reasons to explain why scLDS2 obtains high-quality features of cells. First, scLDS2 characterizes the distribution of scRNA-seq profiles, which accurately captures the features of cells. Second, the sparsity and structural constraint improve the quality of features, facilitating the identification of cell types. These results demonstrate that the features of cells obtained by scLDS2 are promising for the identification of cell types since they are more precise to characterize structure of cell types.

scLDS2 infers informative genes for cell types

Informative genes are bio-markers discriminating cell types, which offer an insight into the development of cells. scLDS2 first identifies cell types, and then extract genes that are discriminative for cell types by using Lasso (Materials and Methods Section).

By taking Mouse brain scRNA-seq data as an example, scLDS2 utilizes coefficients of variables in Lasso to select informative genes, and scLDS2 selects 43 informative genes, as shown in Figure 7A. In this case, scLDS2 only selects informative genes selected, but also ranks them, providing useful clues for biologists. To compare the informative and non-informative genes, we select mutation rate of genes from COSMIC [[63]]. As shown in Figure 7B, informative genes are more likely to be mutated than those non-informative genes (P = 3.7E-2, Student t-test), implying that scRNA-seq is also related to the mutation of genes. Furthermore, six of these informative genes are brain-causing genes, including Maml3, Plcb4, Tuba8, Mdga1, Cpm and Cdon [[64]]. Then, we check the overlapping between the differentially expressed genes. Figure 7C demonstrates that 100% of informative genes obtained scLDS2 are differentially expressed, whereas that for PCA and NMF is 88.2% and 80.2%, respectively. These results demonstrate that scLDS2 further filters differentially expressed genes for cell types.

Graph: Figure 7 class="chapter-para">Analysis of informative genes on Zeisel datasets: (A) Coefficients of informative genes, (B) Distribution of mutations rate of the informative and non-informative genes, (C) Percentage of informative genes obtained various algorithms overlapped by differentially expressed genes, (D) Percentage of informative genes associated with survival time of patients and (E–F) Kaplan–Meier survival analysis of informative gene Maml3 and Ttc39a, respectively.

Evidence indicates that bio-marker genes are associated with the survival time of patients [[66]]. Therefore, we hypothesize that informative genes also serves as bio-markers to predict patients survival time. By using the gene expression profiles and clinical information of gliomas from TCGA, Kaplan–Meier survival analysis is performed to identify informative genes that are significantly associated with survival time of patients. For example, Maml3 separates the patients into high and low/mediate expression groups according to gene expression level, where the survival time of these two groups significantly differs with P = 1.0E-4 (Log-rank test, Figure 7E), and Ttc39a is also significantly associated with the survival time of patient with P = 4.6E-4 (Log-rank test, Figure 7F).

To check the performance of informative genes obtained by various algorithms, the percentage of informative genes that is significantly associated with survival time of patients is shown in Figure 7D. Surprisingly, 30 of the 43 (69.8%) informative genes obtained by scLDS2 predict the survival time of patients, whereas it is 54.8% and 45.5% for PCA and NMF, respectively. These results prove that scLDS2 is more likely to select genes associated with outcomes of patients than baselines, indicating the superiority of the proposed algorithm. To further investigate the biological functions of informative genes, the enrichment analysis is performed using software Metascape [[67]], and the typical functions are highly related to the brain development (P = 5.4E-3, Hypergeometric test) and focal adhesion (P = 2.0E-4, Hypergeometric test). The results on Pancreatic dataset further demonstrate that scLDS2 also precisely selects informative genes (Supplementary materials SFig. 9).

Discussion

The accumulated scRNA-seq data provide an opportunity to exploit the heterogeneity of tissues, which paves the way to explore the underlying mechanisms of biological systems. Even though great efforts have been devoted to this issue, the effective and efficient computational algorithms for the identification of rare cell types is really lacking because the heterogeneity, noise and small samples for some cell types pose a great challenge on designing algorithms to identify rare cell types.

In this study, we propose a deep generative model (scLDS2) for the identification of rare cell types from scRNA-seq data, where the generator in scLDS2 is reconstructed to estimate the distribution of rare cells (Figure 1). First, we demonstrate that the proposed algorithm precisely identifies rare cell types in artificial data (Figure 3 and Supplementary materials SFig. 1). Then, we demonstrate that the proposed algorithm is more accurate in characterizing and identifying rare cell types in real biological data, which cannot be extracted by current algorithms (Figure 4, Supplementary materials SFig. 2 and SFig. 3). Moreover, the experimental results on 20 scRNA-seq datasets demonstrate that scLDS2 significantly outperforms baselines on the identification of cell types in terms of various measurements, indicating the superiority of the proposed algorithm (Figure 5 and Supplementary materials SFig. 4). Furthermore, we demonstrate that scLDS2 improves the quality of features of cells (Figure 6, Supplementary materials SFig. 5, SFig. 6, SFig. 7 and SFig. 8), which facilitates clustering of cells in scRNA-seq data. Finally, scLDS2 provides an efficient strategy to select informative genes that can predict the survival time of patients (Figure 7 and Supplementary materials SFig. 9).

We see ample opportunities to improve on the basic concept of scLDS2 in future work. First, how to integrate omic data is promising for enhancing the accuracy and interpretability of algorithms. Second, scLDS2 fails to make use of indirect relations among cells and genes, and how to exploit biological networks is also interesting for the exploration of implicit patterns. Third, how to accelerate scLDS2 is also important, particularly for scRNA-seq datasets with large-scale samples. Fourth, rare cell types correspond to the transition status of cell proliferation, which are critical for the identification of rare cell types. How to generate the samples of rare cells by simultaneously taking into account the structure of cell types and relations among various cell types is also promising for the analysis of scRNA-seq data.

Key Points

A novel deep generative model scLDS2 for leveraging the small samples of cells is developed by precisely estimating the distribution of cell subpopulations, which discriminates rare and non-rare cell types with deep adversarial learning.

To improve the interpretability of samples, scLDS2 generates the sparse and structural faked samples of cells by imposing sparse and structural constraints, improving the performance and robustness of algorithms for the identification of cell types.

To fully model cell types, scLDS2 simultaneously joins the generation of samples and feature extraction, which provide a better and comprehensive strategy to characterize the structure of cell types.

The experimental results on 20 scRNA-seq datasets demonstrate that scLDS2 significantly outperforms 17 state-of-the-art methods for the identification of cell types in terms of various measurements with 25.12% improvement in ARI on average.

Acknowledgements

The authors thank the members of the Ma lab for helpful discussion, and appreciate the researchers who provide us with source code for a comparison. The authors thank reviewers for their time and suggestions.

Funding

This work was supported by the Shaanxi Natural Science Funds for Distinguished Young Scholars (No. 2022JC-38), Key Research and Development Program of Gansu (Program No. 21YF5GA063), the Fundamental Research Funds for the Central Universities and the Innovation Fund of Xidian University (No. YJS2205).

Author Biographies

Graph

<bold>Haiyue Wang</bold> received the M.S degree in computer technology from the Shandong Normal University, China, in 2019. She is currently working toward the PhD degree in Computer Science and Technology with the Xidian University (P.R.China). Her current research interests include deep learning, machine learning, data mining and bioinformatics.

Graph

<bold>Xiaoke Ma</bold> received his Ph.D. degree in computer science from Xidian University in 2012. He was a post-doctor at the University of Iowa (USA) during 2012–2015. He is a full professor of School of computer science and technology, Xidian University (P.R.China). His research interests include machine learning, data mining and bioinformatics. He is an ad hoc reviewer for many international journals and publishes about 100 papers in the peer-reviewed international journals, such as IEEE Transactions Knowledge and Data Engineering, IEEE Transactions on Cybernetics, Pattern Recognition, Information Sciences, Bioinformatics, PLoS Computational Biology,l Nuclear Acids Research, IEEE Transactions Computational Biology and Bioinformatics, IEEE Transactions NanoBioScience.

REFERENCES

1 Cusanovich DA, Reddington JP, Garfield DA, et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 2018 ; 555 (7697): 538 – 42. Google Scholar Crossref Search ADS PubMed WorldCat

2 Chiou J, Geusz RJ, Okino M-L, et al. Interpreting type 1 diabetes risk with genetics and single-cell epigenomics. Nature 2021 ; 594 (7863): 398 – 402. Google Scholar Crossref Search ADS PubMed WorldCat

3 Dong J, Zhao M, Liu Y, et al. Deep learning in retrosynthesis planning: datasets, models and tools. Brief Bioinform 2022 ; 23 (1):bbab391. Google Scholar OpenURL Placeholder Text WorldCat

4 Wang RP, Dang MH, Harada K, et al. Single-cell dissection of intratumoral heterogeneity and lineage diversity in metastatic gastric adenocarcinoma. Nat Med 2021 ; 27 (1): 141 – 51. Google Scholar Crossref Search ADS PubMed WorldCat

5 Kowalczyk T, Pontious A, Englund C, et al. Intermediate neuronal progenitors (basal progenitors) produce pyramidal–projection neurons for all layers of cerebral cortex. Cereb Cortex 2009 ; 19 (10): 2439 – 50. Google Scholar Crossref Search ADS PubMed WorldCat

6 Han XP, Wang RY, Zhou YC, et al. Mapping the mouse cell atlas by microwell-seq. Cell 2018 ; 172 (5): 1091 – 107. Google Scholar Crossref Search ADS PubMed WorldCat

7 Tang FC, Barbacioru C, Wang YZ, et al. mrna-seq whole-transcriptome analysis of a single cell. Nat Methods 2009 ; 6 (5): 377 – 82. Google Scholar Crossref Search ADS PubMed WorldCat

8 Ramskold D, Luo SJ, Wang YC, et al. Full-length mrna-seq from single-cell levels of rna and individual circulating tumor cells. Nat Biotechnol 2012 ; 30 (8): 777 – 82. Google Scholar Crossref Search ADS PubMed WorldCat

9 Kumar RM, Cahan P, Shalek AK, et al. Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature 2014 ; 516 (7529): 56 – 61. Google Scholar Crossref Search ADS PubMed WorldCat

Petegrosso R, Li ZL, Kuang R. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Brief Bioinform 2020 ; 21 (4): 1209 – 23. Google Scholar Crossref Search ADS PubMed WorldCat

Qiu P. Embracing the dropouts in single-cell rna-seq analysis. Nat Commun 2020 ; 11 (1): 1 – 9. Google Scholar PubMed OpenURL Placeholder Text WorldCat

Dai C, Jiang Y, Yin C, et al. scimc: a platform for benchmarking comparison and visualization analysis of scrna-seq data imputation methods. Nucleic Acids Res 2022 ; 50 (9): 4877 – 99. Google Scholar Crossref Search ADS PubMed WorldCat

Qi R, Ma AJ, Ma Q, et al. Clustering and classification methods for single-cell rna-sequencing data. Brief Bioinform 2020 ; 21 (4): 1196 – 208. Google Scholar Crossref Search ADS PubMed WorldCat

Zhu X, Ching T, Pan XH, et al. Detecting heterogeneity in single-cell rna-seq data by non-negative matrix factorization. PeerJ 2017 ; 5 :e2888. Google Scholar OpenURL Placeholder Text WorldCat

Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell rna-seq data. Nat Rev Genet 2019 ; 20 (5): 273 – 82. Google Scholar Crossref Search ADS PubMed WorldCat

Lin PJ, Troup M, Ho J. Cidr: Ultrafast and accurate clustering through imputation for single-cell rna-seq data. Genome Biol 2017 ; 18 (1): 1 – 11. Google Scholar PubMed OpenURL Placeholder Text WorldCat

Eraslan G, Simon LM, Mircea M, et al. Single-cell rna-seq denoising using a deep count autoencoder. Nat Commun 2019 ; 10 (1): 1 – 14. Google Scholar Crossref Search ADS PubMed WorldCat

Brennecke P, Anders S, Kim JK, et al. Accounting for technical noise in single-cell rna-seq experiments. Nat Methods 2013 ; 10 (11): 1093. Google Scholar Crossref Search ADS PubMed WorldCat

Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom Intel Lab Syst 1987 ; 2 (1–3): 37 – 52. Google Scholar OpenURL Placeholder Text WorldCat

Becht E, McInnes L, Healy J, et al. Dimensionality reduction for visualizing single-cell data using umap. Nat Biotechnol 2019 ; 37 (1): 38 – 44. Google Scholar Crossref Search ADS WorldCat

Zhou B, Jin WF. Visualization of single cell rna-seq data using t-sne in r. In: Stem Cell Transcriptional Networks. Springer, 2020, 159 – 67. Google Scholar Crossref Search ADS Google Preview WorldCat COPAC

Grun D, Lyubimova A, Kester L, et al. Single-cell messenger rna sequencing reveals rare intestinal cell types. Nature 2015 ; 525 (7568): 251 – 5. Google Scholar Crossref Search ADS PubMed WorldCat

Rani Y, Rohil H. A study of hierarchical clustering algorithm. ter S & on Te SIT 2013 ; 2 : 113. Google Scholar OpenURL Placeholder Text WorldCat

Kiselev VY, Kirschner K, Schaub MT, et al. Sc3: consensus clustering of single-cell rna-seq data. Nat Methods 2017 ; 14 (5): 483 – 6. Google Scholar Crossref Search ADS PubMed WorldCat

Yang YC, Huh R, Culpepper HW, et al. Safe-clustering: single-cell aggregated (from ensemble) clustering for single-cell rna-seq data. Bioinformatics 2019 ; 35 (8): 1269 – 77. Google Scholar Crossref Search ADS PubMed WorldCat

Huh R, Yang Y, Jiang Y, et al. Same-clustering: Single-cell aggregated clustering via mixture model ensemble, 2019.

Zhu LX, Lei J, Klei L, et al. Semisoft clustering of single-cell data. Proc Natl Acad Sci 2019 ; 116 (2): 466 – 71. Google Scholar Crossref Search ADS PubMed WorldCat

Wu WM, Ma XK. Joint learning dimension reduction and clustering of single-cell rna-sequencing data. Bioinformatics 2020 ; 36 (12): 3825 – 32. Google Scholar Crossref Search ADS PubMed WorldCat

Wu WM, Liu ZY, Ma XK. jsrc: a flexible and accurate joint learning algorithm for clustering of single-cell rna-sequencing data. Brief Bioinform 2021. Google Scholar OpenURL Placeholder Text WorldCat

Li XJ, Wang K, Lyu YF, et al. Deep learning enables accurate clustering with batch effect removal in single-cell rna-seq analysis. Nat Commun 2020 ; 11 (1): 1 – 14. Google Scholar PubMed OpenURL Placeholder Text WorldCat

Elmarakeby H, Hwang J, Arafeh R, et al. Biologically informed deep neural network for prostate cancer discovery. Nature 598 : 348 – 52. Crossref Search ADS PubMed WorldCat

Dong ZY, Alterovitz G. netae: semi-supervised dimensionality reduction of single-cell rna sequencing to facilitate cell labeling. Bioinformatics 2021 ; 37 (1): 43 – 9. Google Scholar Crossref Search ADS PubMed WorldCat

Gronbech CH, Vording MF, Timshel PN, et al. scvae: Variational auto-encoders for single-cell gene expression data. Bioinformatics 2020 ; 36 (16): 4415 – 22. Google Scholar Crossref Search ADS PubMed WorldCat

Yu B, Chen C, Qi R, et al. scgmai: a gaussian mixture model for clustering single-cell rna-seq data based on deep autoencoder. Brief Bioinform 2020 ; 7453 : 1 – 10. Google Scholar OpenURL Placeholder Text WorldCat

Tian T, Wan J, Song Q, et al. Clustering single-cell rna-seq data with a model-based deep learning approach. Nature Machine Intelligence 2019 ; 1 (4): 191 – 8. Google Scholar Crossref Search ADS WorldCat

Nagy C, Bahrami M, Maitra M. Deep feature extraction of single-cell transcriptomes by generative adversarial network. Bioinformatics 2021 ; 37 (10): 1345 – 51. Google Scholar PubMed OpenURL Placeholder Text WorldCat

Mukherjee S, Asnani H, Lin E, et al. Clustergan: Latent space clustering in generative adversarial networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, 4610 – 7.

Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks arXiv preprint arXiv:1406.2661. 2014.

Xu YG, Zhang ZG, You L, et al. scigans: single-cell rna-seq imputation using generative adversarial networks. Nucleic Acids Res 2020 ; 48 (15): e85 – 5. Google Scholar Crossref Search ADS PubMed WorldCat

Ghahramani A, Watt FM, Luscombe NM. Generative adversarial networks simulate gene expression and predict perturbations in single cells BioRxiv. 2018 ; 262501.

Marouf M, Machart P, Bansal V, et al. Realistic in silico generation and augmentation of single-cell rna-seq data using generative adversarial networks. Nat Commun 2020 ; 11 (1): 1 – 12. Google Scholar Crossref Search ADS PubMed WorldCat

Jaggi M, Sulovsky M. A simple algorithm for nuclear norm regularized problems. In: ICML, 2010.

Das D, George CSL. A two-stage approach to few-shot learning for image recognition. IEEE Trans Image Process 2019 ; 29 : 3336 – 50. Google Scholar Crossref Search ADS WorldCat

Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell rna sequencing data. Genome Biol 2017 ; 18 (1): 1 – 15. Google Scholar Crossref Search ADS PubMed WorldCat

Baron M, Veres A, Wolock SL, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell systems 2016 ; 3 (4): 346 – 60. Google Scholar Crossref Search ADS PubMed WorldCat

Biase FH, Cao XY, Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell rna sequencing. Genome Res 2014 ; 24 (11): 1787 – 96. Google Scholar Crossref Search ADS PubMed WorldCat

Ting DT, Wittner BS, Ligorio M, et al. Single-cell rna sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep 2014 ; 8 (6): 1905 – 18. Google Scholar Crossref Search ADS PubMed WorldCat

Camp JG, Badsha F, Florio M, et al. Human cerebral organoids recapitulate gene expression programs of fetal neocortex development. Proc Natl Acad Sci 2015 ; 112 (51): 15672 – 7. Google Scholar Crossref Search ADS PubMed WorldCat

Tian L, Dong X, Freytag S, et al. Benchmarking single cell rna-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019 ; 16 (6): 479 – 87. Google Scholar Crossref Search ADS PubMed WorldCat

Klein AM, Mazutis L, Akartuna I, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 2015 ; 161 (5): 1187 – 201. Google Scholar Crossref Search ADS PubMed WorldCat

Zeisel A, Munoz-Manchado AB, Codeluppi S, et al. Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq. Science 2015 ; 347 (6226): 1138 – 42. Google Scholar Crossref Search ADS PubMed WorldCat

Tirosh I, Izar B, Prakadan SM, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell rna-seq. Science 2016 ; 352 (6282): 189 – 96. Google Scholar Crossref Search ADS PubMed WorldCat

Giustacchini A, Thongjuea S, Barkas N, et al. Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia. Nat Med 2017 ; 23 (6): 692. Google Scholar Crossref Search ADS PubMed WorldCat

Liao MF, Liu Y, Yuan J, et al. Single-cell landscape of bronchoalveolar immune cells in patients with covid-19. Nat Med 2020 ; 26 (6): 842 – 4. Google Scholar Crossref Search ADS PubMed WorldCat

Schaum N, Karkanias J, Neff NF, et al. Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a tabula muris BioRxiv. 2018 ; 237446.

Pascual-Montano A, Carazo J, Kochi K, et al. Nonsmooth nonnegative matrix factorization (nsnmf). IEEE Trans Pattern Anal Mach Intell 2006 ; 28 (3): 403 – 15. Google Scholar Crossref Search ADS PubMed WorldCat

Kaikun Xie Y, Huang FZ, Liu Z, et al. (eds). scaide: clustering of large-scale single-cell rna-seq data reveals putative and rare cell types. NAR genomics and bioinformatics 2020 ; 2 (4):lqaa082. Google Scholar OpenURL Placeholder Text WorldCat

Dong R, Yuan G-C. Giniclust3: a fast and memory-efficient tool for rare cell type identification. BMC bioinformatics 2020 ; 21 (1): 1 – 7. Google Scholar PubMed OpenURL Placeholder Text WorldCat

Wang B, Zhu JJ, Pierson E, et al. Visualization and analysis of single-cell rna-seq data by kernel-based similarity learning. Nat Methods 2017 ; 14 (4): 414 – 6. Google Scholar Crossref Search ADS PubMed WorldCat

Ulrike VL. A tutorial on spectral clustering. Statistics and computing 2007 ; 17 (4): 395 – 416. Google Scholar OpenURL Placeholder Text WorldCat

Satija R, Farrell J, Gennert D, et al. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 2015 ; 33 (5): 495 – 502. Google Scholar Crossref Search ADS PubMed WorldCat

Hubert L, Arabie P. Comparing partitions. Journal of classification 1985 ; 2 (1): 193 – 218. Google Scholar Crossref Search ADS WorldCat

Forbes SA, Beare D, Gunasekaran P, et al. Cosmic: exploring the world's knowledge of somatic mutations in human cancer. Nucleic Acids Res 2015 ; 43 (D1): D805 – 11. Google Scholar Crossref Search ADS PubMed WorldCat

Gibert B, Delloye-Bourgeois C, Gattolliat C-H, et al. Regulation by mir181 family of the dependence receptor cdon tumor suppressive activity in neuroblastoma. JNCI: Journal of the National Cancer Institute 2014 ; 106 (11). Google Scholar OpenURL Placeholder Text WorldCat

Abdollahi MR, Morrison E, Sirey T, et al. Mutation of the variant -tubulin tuba8 results in polymicrogyria with optic nerve hypoplasia. The American Journal of Human Genetics 2009 ; 85 (5): 737 – 44. Google Scholar Crossref Search ADS PubMed WorldCat

Zeng QQ, Michael IP, Zhang P, et al. Synaptic proximity enables nmdar signalling to promote brain metastasis. Nature 2019 ; 573 (7775): 526 – 31. Google Scholar Crossref Search ADS PubMed WorldCat

Zhou YY, Zhou B, Pache L, et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat Commun 2019 ; 10 (1): 1 – 10. Google Scholar PubMed OpenURL Placeholder Text WorldCat

By Haiyue Wang and Xiaoke Ma

Reported by Author; Author

Xiaoke Ma