Determination of essential phenotypic elements of clusters in high-dimensional entities—DEPECHE
Determination of essential phenotypic elements of clusters in high-dimensional entities-DEPECHE
Axel Theorell 0 1
Yenan Troi Bryceson 1
Jakob TheorellID 1
0 IBG-1: Biotechnology, Institute of Bioand Geosciences, Forschungszentrum J u ?lich GmbH , J u ?lich, North Rhine-Westphalia, Germany , 2 Center for Hematology and Regenerative Medicine, Department of Medicine Huddinge, Karolinska Institutet, Stockholm, Sweden, 3 Broegelmann Research Laboratory, Department of Clinical Medicine, University of Bergen , Bergen , Norway , 4 Nuffield Department of Clinical Neurosciences, University of Oxford , Oxford , United Kingdom
1 Editor: Zhi Wei, New Jersey Institute of Technology , UNITED STATES
Technological advances have facilitated an exponential increase in the amount of information that can be derived from single cells, necessitating new computational tools that can make such highly complex data interpretable. Here, we introduce DEPECHE, a rapid, parameter free, sparse k-means-based algorithm for clustering of multi- and megavariate single-cell data. In a number of computational benchmarks aimed at evaluating the capacity to form biologically relevant clusters, including flow/mass-cytometry and single cell RNA sequencing data sets with manually curated gold standard solutions, DEPECHE clusters as well or better than the currently available best performing clustering algorithms. However, the main advantage of DEPECHE, compared to the state-of-the-art, is its unique ability to enhance interpretability of the formed clusters, in that it only retains variables relevant for cluster separation, thereby facilitating computational efficient analyses as well as understanding of complex datasets. DEPECHE is implemented in the open source R package DepecheR currently available at github.com/Theorell/DepecheR.
Data Availability Statement: All the data presented
in the paper is freely available. All data sources are
described in the S2 File.
Funding: The authors received no specific funding
for this work.
Competing interests: The authors have declared
that no competing interests exist.
Since the introduction of the first single colour flow cytometers in the 1960s, there has been a
remarkable increase in the complexity of data that can be generated with single-cell resolution.
Currently, flow and mass cytometers able to simultaneously assess up to 40 cellular traits are
becoming widely available [
]. In parallel, the development of high-throughput sequencing
technology has facilitated single-cell transcriptomic analyses gauging expression of thousands
of distinct transcripts [
]. Furthermore, development of high-resolution single-cell proteomic
analyses are underway, similarly providing highly complex datasets [
These technological advances necessitate new computational approaches to analyses of
multi- and megavariate single cell data [
]. Previous algorithms have contributed to
automating analyses, thereby enhancing reproducibility and avoiding a need for a priori biological
knowledge. Thus, these algorithms have major advantages over the classical analysis strategies
of cytometry data that are based on manually defined uni- or bivariate filters, commonly
referred to as gates . Automated analysis algorithms, not restricted to uni- or bivariate
displays of the data, have also made it possible to extract much more of the information
embedded in multivariate data. Furthermore, as they generally scale well, automated analysis
algorithms can also be applied to datasets where manual gating is impossible, such as
singlecell transcriptomic data with thousands of transcripts assessed for each cell. To date, however,
analysis strategies based on manual gating still dominate where they can be used and are still
considered as the gold standard for cytometry data analysis. One likely reason for this is that
cell types defined by manual gates are easy to comprehend, as they are defined by few markers.
In an attempt to combine the objectiveness, reproducibility and scalability of automated
analysis pipelines with the high interpretability of manual gating strategies, we have developed an
algorithm termed Determination of Essential Phenotypic Elements of Clusters in
High-dimensional Entities (DEPECHE). DEPECHE simultaneously clusters and simplifies the data by
identifying the variables that contribute to separate individual clusters from the rest of the
data. We have implemented DEPECHE in R [
] (in the open source package DepecheR),
providing a complete software suite for statistical analysis and visualization of single cell omics
Results and discussion
At its core, DEPECHE uses a penalized k-means clustering algorithm, related to the standard
k-means algorithm [
]. Both k-means and penalized k-means clusters data by fitting k cluster
centers to the data. In k-means, the fitting objective is to minimize the sum of squared
distances from each data point to its closest cluster center. In penalized k-means, the k-means
objective has been complemented by a penalization objective which minimizes the sum of the
L1-norms of the cluster centers, thereby moving them towards the origin.
Generally, when measuring a large number of biological variables, observations tend to
agglomerate around certain positions in some variables, whereas in other variables, the
observations are spread more or less evenly. Due to their even spread, the latter variables can be
considered uninformative. In this situation, the dual clustering objective of penalized k-means will
fit the cluster centers to the observation agglomerates in the informative dimensions, whereas
the penalty will draw the cluster centers to the origin in the uninformative dimensions and
thus discard them. The resulting cluster definitions are referred to as sparse, as the clusters are
only defined in the informative dimensions. The relative importance of the k-means objective
and the penalization objectives in the penalized k-means algorithm is controlled by the penalty
parameter ?. A high penalty ? implies that many variables will be considered uninformative,
implying that many variables will be discarded. When all variables of a cluster are discarded,
the cluster is discarded too. Thus, a high penalty ? results in high sparsity and a low cluster
resolution, and vice versa [
] (see Methods, ?Clustering with DEPECHE?). A key feature of
penalized k-means is that the penalty addresses each variable in each cluster center
independently, implying that different sparsity patterns will emerge from distinct clusters. This feature
differentiates penalized k-means from sparse k-means [
] and regularized k-means [
which only identify variables that are uninformative for all clusters. To illustrate this, a 10,000
point, 11-variate dataset with 10 normally distributed clusters was constructed. In this dataset,
10 of the variables contributed to the separation of two clusters (S1 Fig). In this example, sparse
k-means could only exclude the single variable, which did not define any clusters, whereas
DEPECHE correctly excluded all nine dispensable variables from each cluster individually.
Note that if k is chosen so large that at least one cluster is discarded, the resolution of the
2 / 15
emerging clusters depends entirely on the magnitude of the penalty ? and not on k, since
DEPECHE redistributes observations belonging to clusters that are pulled to the origin to the
remaining clusters. In DEPECHE, the penalty ? is tuned to identify the most reproducible
clustering resolution, here termed the ?optimal resolution? [
]. To illustrate what we mean by
reproducible, we constructed an example, featuring a bi-variate dataset D (Fig 1A). Visually,
dataset D contains three clusters, where the centers of the two larger clusters are located close
to either axis. For these two clusters, one variable is sufficient to define their position. Now, if
multiple datasets were generated from the same data source as D, for example by repeated
experiments, we assume that they would contain the same clusters. Hence, the optimal penalty
?i (that corresponds to the optimal resolution of D) is defined as the penalty for which the
resulting clusters have the highest similarity, where similarity is measured in the Adjusted
Rand Index (ARI) [
] (see Methods, ?Tuning the penalty?), when clustering the generated
datasets independently. In contrast, when clustering the same datasets with a penalty ? that
differs significantly from the optimal penalty, the stochastic differences between the datasets are
likely to induce solutions that deviate in cluster number, number of defining variables, and
cluster center positions. In practice, DEPECHE tests a range of penalty values, (?1< <?N?),
each on a collection of dataset pairs which are generated by sampling Nr data points from D
(Fig 1B) with resampling. The optimal resolution is defined as the penalty ?i, which yields the
lowest average variability within each dataset pair, as measured by the ARI. In our example,
this corresponds to the penalty ?i that yields 3 clusters, since 3 similar clusters are identified in
each resampled dataset of D (Fig 1C). The penalties ?1 and ?N? are considered suboptimal,
since with these penalties, the stochastic differences in the resampled datasets lead to less
coherent clustering results compared to results obtained with the optimal penalty ?i. From
here (Fig 1C) DEPECHE uses two alternative routes. If the dataset D has few data points, the
full dataset D is clustered using the optimal penalty ?i (Fig 1D). If instead the number of data
points in the dataset D is high (default in DepecheR >104), the most generalizable cluster
centers that were produced using the optimal penalty are chosen (see Methods, ?Simultaneous
Clustering and Parameter Tuning?) and the data points of D are allocated directly to their
closest cluster center (Fig 1E). This default setting enables rapid iterated clustering of large datasets
possible (see calculation times in S2 Fig).
To evaluate how biologically accurate DEPECHE clustering is on mass cytometry data, a
32-variate mass cytometry bone marrow dataset [
] was clustered, and the overlap to 14
visually pre-defined cell populations was quantified using the ARI. With this dataset, DEPECHE
identified 7 clusters at the optimal resolution, corresponding to all large pre-defined cell
populations and to agglomerates of smaller cell populations, rendering an average ARI of 0.96,
where an ARI of 1 corresponds to exact reproduction and an ARI of 0 means that the produced
clusters are no more accurate than random allocation (Fig 2A and 2B, S2A Fig). Furthermore,
using the DEPECHE algorithm, the number of variables defining each cluster was reduced
from 32 to a range from 8 to 28, thereby enhancing interpretability (Fig 2C). When comparing
to other state-of-the-art clustering algorithms [
], DEPECHE obtained similar ARI as the
best algorithms for both the 32-variate dataset and another 14-variate, 24 population, mass
cytometry dataset  (S3A Fig, Table 1).
Single-cell transcriptomic datasets feature tens of thousands of variables. Thus, compared
to cytometry datasets, the need to exclude variables of low interest is even more pressing. We
therefore evaluated DEPECHE?s ability to cluster and extract the key transcripts defining
clusters of a previously published single-cell transcriptomic dataset (Fig 2D?2F) [
]. In this
dataset, a total of 648 ILC1, ILC2, ILC3 and NK cells from three donors? tonsils were index-sorted
prior to RNA sequencing. Hence, these cell types, defined by expression of specific sets of
markers using manual gating, can be compared to clusters unbiasedly determined by RNA
3 / 15
Fig 1. Illustration of the DEPECHE workflow. A) The original dataset D. B) Resampled datasets with Nr data points
per dataset, are generated by sampling data points from D with resampling. Each resampled dataset has a
corresponding penalty ?i (i = 1. . .N?). C) Each dataset in b is clustered with sparse k-means, using its corresponding
penalty ?i. The red frame highlights the clustering with the strongest attractors, i.e. the most generalizable solution (see
Methods, ?Simultaneous Clustering and Parameter Tuning?). D, E) Finally, the full dataset is clustered. In cases where
it is computationally feasible, the full data set is re-clustered with the optimal penalty (D). Otherwise, the final
clustering is produced by allocating each data point to its closest cluster center, using the most generalizable cluster
center solution produced with the optimal penalty in B (E).
4 / 15
Fig 2. DEPECHE performance with real datasets with 32 or 35177 variables. A-B) bi-variate t-distributed stochastic
neighbor embedding (tSNE) representation of the 32-variate mass cytometry data. A) distribution of manually defined
cell populations over the tSNE field. B) distribution of DEPECHE clusters over the tSNE field. C) Heatmap of the
expression pattern at each cluster center. Light colors show that the marker is highly expressed at the cluster center,
whereas dark colors shows the opposite. Grey color indicates that the variable in question does not contribute to
defining the cluster. For Fig a-c all, 104184 cells have been clustered. D-E) tSNE representation of the 137-variate data
subset that could efficiently distinguish the clusters in the 35177-variate single-cell transcriptome dataset. D)
distribution of the cell types defined by index-sorting and manual gating on protein expression profiles shown over the
tSNE field. E) distribution of DEPECHE clusters over the tSNE field. F) Violin plots illustrating the overlap between
the original analysis by Bjo?rklund et al and the DEPECHE analysis. For each subplot, the left and right side illustrate
the distribution of the transcripts defining the clusters, and all other transcripts, respectively. The y-axis shows the
log10 of the p-values in the original analysis adjusted for multiple comparisons. For Fig D-F, all 648 cells have been
5 / 15
n variables in analysis
n clusters in original
expression profiles [
]. In the DEPECHE analysis, no pre-selection of transcripts was
performed, and hence, 35177 unique transcripts were included for each of the 648 cells. With the
optimal penalty ?, four clusters were identified (Fig 2E). These corresponded well to the cell
types as defined by protein expression; 84, 97, 91 and 97 percent of ILC1, ILC2, ILC3 and NK
cells sorted into separate clusters, respectively (Fig 2D and 2E, S2B Fig), leading to an average
ARI of 0.78, which was comparable to clusters obtained in the original publication [
Notably, cluster 1?4, corresponding to NK cells, ILC1, ILC3 and ILC2, were defined by 10, 27, 108
and 27 transcripts, respectively (Figs 2F and 3), leading to a 99.9% average decrease in the
number of variables. The transcripts identified to define the clusters in our analysis
corresponded to those identified as most differentially expressed in the original study [
] (Fig 2F).
Thus, by identifying a finite number of variables, DEPECHE analysis can increase
interpretability and aide down-stream analyses. When DEPECHE clustering was compared to that of
state-of-the-art algorithms [
] on the aforementioned dataset as well as six other datasets
(see Table 1), it consistently performed well as indicated by ARI (S3B Fig). Thus, when applied
to megavariate single-cell transcriptomic data, DEPECHE identifies clusters corresponding to
known cell types and reduces the complexity of the results thousand-fold by finding a subset of
markers that define each cluster in a way that compares well to previous knowledge. Notably,
as DEPECHE selects the most robust identifiers of each cluster, some potentially interesting
transcripts might be excluded by DEPECHE clustering, e.g. transcripts that are highly
expressed in a minority, but lacking from a majority of cells in a cluster.
In conclusion, DEPECHE turns the penalized k-means methodology into a parameter
free analysis technique guided by efficient calculation of the optimal clustering resolution.
By doing so, it can simultaneously address the problems of finding biologically relevant
clusters and identifying specific variables that define these clusters. This is crucial in order to
comprehend the noisy and often over-complicated data generated with current single cell
Clustering with DEPECHE
Clustering in DEPECHE is performed using a penalized version of the k-means algorithm,
which is related to the k-means algorithm [
]. In this section, the k-means algorithm is
outlined, followed by an explanation of how it is extended to penalized k-means.
The k-means algorithm clusters data by fitting a mixture of normal distributions to the data
with k equal mixture components and unit variance. Formally, k d-dimensional cluster centers,
denoted ?m,j where m = 1. . .k and j = 1. . .d, are fitted to the n d-dimensional datapoints xl,j,
6 / 15
Fig 3. Transcripts defining clusters in Bjo?rklund et al dataset. Blue background color indicates 10 transcripts
associated with cluster 1/NK cells. Violet background color indicates 27 transcripts associated with cluster 2/ILC1.
Green background color indicates 27 transcripts associated with cluster 4/ILC2. Yellow color indicates 108 transcripts
associated with cluster 3/ILC3.
where l = 1. . .n, by maximizing the score function
where zm,l is 1 if the lth data point belongs to the ith cluster and zero otherwise. The score Q is
optimized using an Expectation Maximization (EM) algorithm [
], i.e. so called E- and
Msteps are iterated alternatingly until the score Q stops improving. In the E-step, the allocation
variables zm,l are updated so that each data point is allocated to its closest cluster according to
the Euclidean norm. That means that zm,l = 1 if the mth cluster center is the cluster center
closest to the lth data point and zm,l = 0 otherwise [
]. In the M-step, each cluster center ?m,j is
moved to the center of the data points allocated to it. When no more reallocation occurs in the
E-step, the algorithm has converged.
In order to reduce the influence of uninformative dimensions that only contribute with
noise, penalized k-means introduces an L1-penalty for each element of each cluster center ?m,j
to the optimization objective. Formally, the score function Q in Eq (1), is updated:
where ? is a positive penalty term. The additional term in the score function, introduced in Eq
(2) results in a change in the M-step of the original EM-algorithm of the k-means algorithm.
Keeping zm,l for all l fixed and optimizing Q with respect to ?m,j, the M-step is:
mm;j ? sign
Depending on the choice of the penalty parameter ?, some components of some clusters
centers will be set to 0 in the M-step. Note that penalized k-means with penalty ? = 0 reduces
to the original k-means algorithm.
In DEPECHE, cluster centers that are moved to the origin in the M-step are eliminated and
not assigned any data points in the E-step. Due to the elimination of clusters, the number of
produced clusters is independent of k and dependent on the penalty ? as long as at least one
cluster is eliminated. In DEPECHE, k is always chosen to be so large that at least one cluster is
eliminated. Eq (2) is a special case of the penalized model based clustering algorithm by Pan
and Shen with unit variance and equal mixture components [
]. By imposing the penalty for
each dimension and each cluster, penalized k-means identifies the dimensions that do not
distinguish a particular cluster from the rest of the data, thus leaving these dimensions out of the
definition of that cluster.
Penalized k-means, as well as k-means, relies on a procedure for initializing the positions of
the cluster centers. Therefore, in DepecheR, the initial cluster positions are generated using the
well performing seed generation algorithm of k-means++ by Arthur and Vassilvitskii [
The early iterations of the EM-algorithm are particularly delicate in DEPECHE, due to the
elimination of clusters at the origin in the E-step. Poor initialization of the clusters, in
combination with a high penalty ?i, might lead to elimination of too many clusters in the early
Esteps, yielding fewer clusters in the end result than necessary to optimize Q. To remedy this,
DEPECHE starts each run of penalized k-means, regardless of the chosen ?i, with penalty ? =
0. The penalty is then increased linearly over a number of E-steps until it reaches the
predetermined value ?i.
The EM-algorithm guarantees convergence to an optimum of the score Q, but not
necessarily to the global optimum. In order to diminish the influence of the starting state, the
EM-algorithm is run several times with random initialization, and the solution with optimal score Q is
chosen. For penalty optimization, the number of runs is determined according to a set of rules
that are outlined in the section ?Tuning the penalty?. For producing the final results (Fig 1E), a
fixed number of runs is performed (21 (3 x (8?1)) processor cores) runs is the default in
DepecheR). To further decrease stochastic variability in the computational results, k is set
considerably higher than the expected number of final clusters, which also diminishes the sensitivity to
the starting state. In the extreme case where k is set equal to the number of data points n, the
outcome of penalized k-means is deterministic.
8 / 15
Tuning the penalty
In this section, we describe the optimization scheme which is used for tuning the linear penalty
?. The outline of the algorithm:
1. Choose a wide and exponentially distributed range of penalty terms ?i, i = 1..N?, that are
considered for clustering the dataset D.
2. Create 2 datasets per penalty term ?i, called D1,i and D2,i, by sampling Nr data points from
D with replacement.
3. Run the penalized k-means algorithm on the datasets D1,i and D2,i, yielding sets of cluster
centers, denoted ?1,i and ?2,i.
4. Create the partitions P1,i and P2,i, by allocating all data points of the dataset D to their
nearest cluster center of the sets ?1,i and ?2,i. The allocation is equivalent to one E-step of the
5. Determine the Adjusted Rand Index (ARI), denoted r(?i) from P1,i to P2,i [
6. Repeat step 2?5 and average the obtained ARIs r(?i) penalty wise until a stopping criteria
regarding the statistical certainty of the obtained ARIs r(?i) is met (remark 2, section
?Tuning the penalty).
7. Choose the optimal penalty ?i, which is the penalty with the largest ARI r(?i).
1. To increase the likelihood of finding he optimal penalty value in the pre-defined range,
datasets with very high kurtosis are log2-transformed and all datasets are divided by their
total standard deviation prior to sampling. With this setup, all tested datasets have their
optimal penalty in the default range. However, new datasets might have other requirements
for penalties. In cases where the most extreme low or high penalty value is selected by the
optimization procedure, DepecheR will warn the user, and suggest another range of
penalties, or a larger sample size. The default penalty range in DepecheR is 20, 20,5. . . 25.
2. The repetition of Step 6 is necessary, since the obtained ARI r(?i) is a random variable, due
to the random procedure for creating the datasets D1,i and D2,i and the random procedure
for initializing the penalized k-means algorithm. To determine the necessary number of
runs, after an initial fixed number of runs (default 20 in DepecheR), DEPECHE uses three
stopping criteria: The first criterion creates an interval of width 2 standard errors around
the obtained mean of r(?i) and checks if the interval around the optimal ARI r(?i) has a zero
overlap with the other intervals. The second criterion checks whether the standard error of
the mean of r(?i) for the optimal penalty ?i is below a threshold. The third criterion
terminates the process after a maximal number of runs (default 100 in DepecheR).
3. Step 2 requires a samples size Nr. A natural choice is to set Nr equal to the number of data
points, n. However, in cases where n is very large, so that the computational load of the
optimization scheme becomes limiting, it is preferable to choose a smaller Nr. In DepecheR,
Nr = 104 by default, in case n 10 4. Notice that when an optimal penalty ?i is discovered
using sample size Nr6?n, the corresponding optimal penalty when sampling the full dataset
D with magnitude n is (approximately) ?i n/Nr, since the attraction force of a cluster is
proportional to the number of data points in it.
4. In step 5, the ARI is chosen as similarity measure between the partitions, since it corrects
for chance. This means that it gives a zero similarity for trivial partitions, such as having all
9 / 15
data points in a single cluster. Exact calculation of the ARI is computationally intractable
for large datasets. Therefore, DEPECHE relies on an approximate ARI computation, based
on 104 random pairs of data points.
Simultaneous clustering and parameter tuning
For very large datasets (n>108), not only the penalty optimization, but also the final
clustering once the optimal penalty has been found, may be computationally intractable. However,
increasing the size of the dataset does not necessarily lead to an increase in number of
clusters at the optimal resolution. In this case, it is feasible to cluster a subset of the full dataset
D to obtain cluster centers M and then allocate the remaining data points of D to their
closest clusters in M. This boosts computational efficiency since allocation imposes a much
smaller computational load than clustering. Since several subsets of D are produced and
clustered during the tuning of the penalty parameter ?, it is computationally favorable to
retrieve cluster centers M that were produced during the parameter tuning and use them to
When picking a set of cluster centers M from the penalty tuning, the question arises which
set of centers M to take, since several sets of centers, denoted Mi,p (p = 1. . .Np), are produced
for the optimal penalty ?i. In DEPECHE, the centers Mi,p that have the strongest similarity (on
average) to the remaining Np-1 centers is chosen and is referred to as the most generalizable
cluster set. The level of similarity between the centers Mi,p? and Mi,p?? is quantified using the
ARI between the partitions Pi,p? and Pi,p??, induced by allocating each data point of D to its
closest cluster center in Mi,p? and Mi,p?? respectively.
Empiric performance of the penalty tuning scheme. Roughly speaking, DEPECHE
combines a flavored penalized k-means algorithm with a parameter tuning scheme, which
identifies an optimal resolution. A naturally arising question is then whether the parameter tuning
scheme is able to determine a biologically relevant resolution or if other penalized k-means
clustering resolutions outperform the resolution chosen by DEPECHE. Using a range of
datasets (Table 2), the biological relevance (measured in ARI to the manually curated solution) of
the optimized DEPECHE partitions were compared to the biologically optimal partition
among all partitions generated with 20 repetitions on each of a range of 11 penalties per
dataset. Overall, the DEPECHE resolution-selection showed close to optimal performance, as the
selected solutions only had a median of 0.02 lower ARI to the gold standard (range 0?0.065)
than the best possible solution with all penalties (Table 2).
Median ARI in S2 Fig
Maximal ARI with any penalty
Transforming and centering the data
The clusters produced by DEPECHE, as well as their interpretation, depends on the
transformation and centering of the data. The transformation determines the relative importance of
the measured variables, where variables with a larger spread have stronger influence on the
clustering. The centering defines where zero occurs in each variable, thereby influencing the
clustering results due to the linear penalty.
DEPECHE is applicable to a large range of datasets where the numbers of dimensions, d,
and the number of data points, n, can vary with many orders of magnitude. The differing
characteristics of these datasets require different treatments with respect to transformation and
Transformation. Empirically, a majority of single-cell transcriptome datasets tend to
have a few variables where the variance is many orders of magnitude greater than in the other
variables. In this case, the high-variance variables will de-facto determine the clustering,
implying that the clustering will fail to take the majority of the measured information into account.
To even out the influence of these high-variance variables on the clustering outcome, the data
is log transformed when such variables are present. In DepecheR, this data behavior is detected
automatically by concatenating all variables into a one-dimensional vector, for which the
kurtosis is calculated. A high kurtosis indicates that the variables differ greatly in their variance.
For datasets with low kurtosis, refraining from the log transform is preferable to avoid
unnecessary data distortion.
Centering. Centering the origin to be close to the bulk of the data is preferable, in order to
have all biological clusters at approximately the same distance from the origin. Having some
biological clusters close to the origin and some far off is often unwanted, since the linear
penalty then imposes a preference for creating clusters close to the origin. Apart from influencing
the clustering, the centering also determines the interpretation of the obtained sparsity. Just as
for scaling, which centering scheme to apply depends on the dataset.
For low dimensional datasets (d>100), DEPECHE applies maximal density centering,
which sets the zero in each dimension to coincide with the highest data density. The density it
computed by collecting the data in equally spaced bins (default number of bins in DepecheR is
the number of data points n divided by 50), where the bin with the highest number of data
points has the highest density. Using this scheme, sparsity (i.e. that a variable is
non-contributing to the definition of a cluster) is interpreted as non-deviance from the most common
outcome. It also ensures that the origin is relatively close to the bulk of the data, since it is located
at the most common outcome for each variable respectively. The benefit of this scheme is that
it boosts sparsity, by declaring the most common outcome non-defining. However, for high
dimensional datasets (d 100), maximal density centering can push the origin so far away
from the center of mass of the dataset, that the penalty starts to impose an unwanted, artificial
influence on the clustering, hampering the biological relevance of the clusters. To avoid this,
DEPECHE imposes a mean centering scheme for such datasets, which locates the origin at the
center of mass of the dataset.
A potential complication, related to centering, occurs when a biologically relevant cluster is
located very close to the origin, since DEPECHE creates no clusters in the origin and will then
force the cluster to merge with other clusters. However, this scenario was never detected in
Generation and analysis of synthetic example for k-means algorithm comparisons.
The synthetic dataset was generated using base package functions. The standard deviation was
11 / 15
identical for all clusters and dimensions. For classical k-means analysis, the k-means function
in the stats package was used. For sparse k-means, the sparcl package was used .
Preprocessing of mass cytometry data. The benchmark datasets from Levine et al [
and Bendall et al [
] were transformed using the flowTrans package  before used in any
Preprocessing of single-cell transcriptomic data. The dataset from Bjo?rklund et al [
was normalized using the sva package  as in the original manuscript. For this dataset,
nonexpressed transcripts were removed, lowering the number of variables from 64443 to 35177.
The gold-standard datasets used for benchmarking in the publication by Kiselev et al [
were obtained in a pre-processed state. Before clustering with any algorithm, the gene filter
function used in the sc3 package was used [
], with settings removing the genes that were
expressed in more than 90% of the cells. This resulted in the number of transcripts presented
in Table 1 (range 13608?19571 transcripts).
All code necessary to generate the figures and tables in the manuscript are included in S1 File.
The software package DepecheR is available for download at (https://github.com/Theorell/
S1 Fig. A comparison of classical K-means, sparse K-means and DEPECHE. A: Overlap
between true clusters and clusters generated with classical k-means, sparse K-means, or
DEPECHE. Numbers in heatmaps denote percent of the true cluster present in the generated
cluster in question. Red color indicates high overlap, blue color low overlap. B: Total loadings
for all input vectors with sparse k-means. Light color indicates a strong contribution to
separation of the clusters and vice versa. Grey color indicates that the variable has been excluded. C:
Cluster center matrix for DEPECHE analysis. Rows indicate the DEPECHE clusters, columns
indicate the variables that contribute to separating at least one cluster. A light color indicates
that the cluster center is located in the upper part of the distribution of values in the vector
in question, and vice versa. Grey color indicates that the variable has been excluded. For
DEPECHE, only 10 dimensions are shown, as the 11th did not contribute to separating any
S2 Fig. Heatmaps comparing the golden standard partitions to the DEPECHE partitions
for A) the 32-variate Levine dataset and B) the 35166-variate Bjo?rklund dataset. Red color
indicates large overlap, blue color indicates low overlap between a gold
standard-vsDEPECHE cluster pair. Numbers in heatmaps denote percent of the golden standard cluster
present in the DEPECHE cluster in question.
S3 Fig. Algorithm comparisons. For all graphs, the x-axis shows the algorithms and the y-axis
shows the Adjusted Rand Index comparing the clustering result with the golden standard
clustering. Below each graph is the average computational time in seconds for the benchmarking
performed on a laptop computer with 4 2.8 GHz Intel Core i7 processors. A) Subsamples with
20000 unique cells from two mass cytometry datasets published by Levine et al and Bendall
et al were clustered with DEPECHE and six previously published algorithms. For each dataset
and algorithm, clustering was performed on 20 unique subsamples. For flowClust, flowPeaks
and SamSPECTRAL, that do not perform internal parameter tuning, a range of parameter
12 / 15
values were evaluated and the parameter value sets generating the highest ARI values were
selected for display. B) The full Bjo?rklund dataset, as well as six other datasets previously used
for benchmarking by Kiselev et al were clustered 20 times with DEPECHE and three other
algorithms. The Bjo?rklund dataset was normalized to reduce batch effects, with the procedure
described in the original publication. These six datasets were also automatically
log2-transformed within DEPECHE, and thus, log2-transformation was applied also for Sincera and
pcaReduce, whereas sc3 was fed both log2- and untransformed data. The lower and upper
hinges of all boxplots extend to the 25:th and 75:th percentile, whereas the line in the middle
describes the median. The whiskers extend to the lowest and highest value no further than 1.5
times the distance between the 25:th and 75:th percentile. Outside of this range, the
observations are considered outliers and are shown as dots.
S1 File. The code used generate all figures.
S2 File. Information on how to retrieve the data used for this study.
The authors are grateful for all important input that has come from the initial users of the
DepecheR software, especially Dr Niklas Bjo?rkstro?m, Dr Jakob Michae?lsson and Sigrun Stultz.
Other colleagues that have contributed seminally to the framework of ideas that have led to the
creation of DEPECHE are Dr Bruce Bagwell, Dr Ryan Ramanujam and Dr Geoffrey Hart.
Conceptualization: Axel Theorell, Jakob Theorell.
Data curation: Axel Theorell, Jakob Theorell.
Formal analysis: Axel Theorell, Jakob Theorell.
Investigation: Axel Theorell.
Methodology: Axel Theorell, Jakob Theorell.
Software: Axel Theorell, Jakob Theorell.
Supervision: Yenan Troi Bryceson, Jakob Theorell.
Validation: Axel Theorell, Jakob Theorell.
Visualization: Jakob Theorell.
Writing ? original draft: Jakob Theorell.
Writing ? review & editing: Axel Theorell, Yenan Troi Bryceson, Jakob Theorell.
13 / 15
14 / 15
1. Spitzer MH , Nolan GP . Mass Cytometry: Single Cells, Many Features . Cell . 2016 ; 165 : 780 - 791 . https://doi.org/10.1016/j.cell. 2016 . 04 .019 PMID: 27153492
2. Tanay A , Regev A . Scaling single-cell genomics from phenomenology to mechanism . Nature . 2017 ; 541 : 331 - 338 . https://doi.org/10.1038/nature21350 PMID: 28102262
3. Budnik B , Levy E , Slavov N . Mass-spectrometry of single mammalian cells quantifies proteome heterogeneity during cell differentiation . bioRxiv . 2017 ; 102681 . https://doi.org/10.1101/102681
4. Saeys Y , Gassen SV , Lambrecht BN . Computational flow cytometry: helping to make sense of highdimensional immunology data . Nat Rev Immunol . 2016 ; 16 : 449 - 462 . https://doi.org/10.1038/nri. 2016 . 56 PMID: 27320317
5. Kiselev VY , Kirschner K , Schaub MT , Andrews T , Yiu A , Chandra T , et al. SC3: consensus clustering of single-cell RNA-seq data . Nat Methods . 2017 ; 14 : 483 - 486 . https://doi.org/10.1038/nmeth.4236 PMID: 28346451
6. Guo M , Wang H , Potter SS , Whitsett JA , Xu Y. SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis . PLoS Comput Biol . 2015 ; 11 : e1004575. https://doi.org/10.1371/journal.pcbi. 1004575 PMID: 26600239
7. Z? urauskien? J , Yau C. pcaReduce: hierarchical clustering of single cell transcriptional profiles . BMC Bioinformatics . 2016 ; 17 : 140 . https://doi.org/10.1186/s12859-016-0984-y PMID: 27005807
8. Bashashati A , Brinkman RR . A Survey of Flow Cytometry Data Analysis Methods . Adv Bioinforma . 2009 ; 2009 . https://doi.org/10.1155/ 2009 /584603 PMID: 20049163
9. R Core Team. R: A Language and Environment for Statistical Computing [Internet] . Vienna, Austria: R Foundation for Statistical Computing; 2017 . https://www.R-project.org
10. MacQueen J. Some methods for classification and analysis of multivariate observations . Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability . Berkeley, California: University of California Press; 1967 . pp. 281 - 297 . https://projecteuclid.org/euclid.bsmsp/1200512992
11. Pan W. Penalized model -based clustering with application to variable selection . J Mach Learn Res . 2007 ; 8 : 1145 - 1164 .
12. Witten DM , Tibshirani R . A framework for feature selection in clustering . J Am Stat Assoc . 2010 ; 105 : 713 - 726 . https://doi.org/10.1198/jasa. 2010 .tm09415 PMID: 20811510
13. Sun W , Wang J , Fang Y . Regularized k-means clustering of high-dimensional data and its asymptotic consistency . Electron J Stat . 2012 ; 6 : 148 - 167 . https://doi.org/10.1214/12-EJS668
14. Levine JH , Simonds EF , Bendall SC , Davis KL , Amir ED , Tadmor MD , et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis . Cell . 2015 ; 162 : 184 - 197 . https://doi.org/10.1016/j.cell. 2015 . 05 .047 PMID: 26095251
15. Lo K , Hahne F , Brinkman RR , Gottardo R. flowClust: a Bioconductor package for automated gating of flow cytometry data . BMC Bioinformatics . 2009 ; 10 : 145 . https://doi.org/10.1186/ 1471 -2105-10-145 PMID: 19442304
16. Aghaeepour N. flowMeans: Non-parametric Flow Cytometry Data Gating . 2010 .
17. So?rensen T , Baumgart S , Durek P , Gru?tzkau A , Ha?upl T. immunoClust-An automated analysis pipeline for the identification of immunophenotypic signatures in high-dimensional cytometric datasets . Cytom Part J Int Soc Anal Cytol . 2015 ; 87 : 603 - 615 . https://doi.org/10.1002/cyto.a.22626 PMID: 25850678
18. Ge Y , Sealfon SC . flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding . Bioinforma Oxf Engl . 2012 ; 28 : 2052 - 2058 . https://doi.org/10.1093/bioinformatics/ bts300 PMID: 22595209
19. Zare H , Shooshtari P , Gupta A , Brinkman RR . Data reduction for spectral clustering to analyze high throughput flow cytometry data . BMC Bioinformatics . 2010 ; 11 : 403 . https://doi.org/10.1186/ 1471 - 2105-11-403 PMID: 20667133
20. Bendall SC , Simonds EF , Qiu P , Amir ED , Krutzik PO , Finck R , et al. Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum . Science . 2011 ; 332 : 687 - 696 . https://doi.org/10.1126/science.1198704 PMID: 21551058
21. Bjo?rklund ?K, Forkel M , Picelli S , Konya V , Theorell J , Friberg D , et al. The heterogeneity of human CD127(+) innate lymphoid cells revealed by single-cell RNA sequencing . Nat Immunol . 2016 ; 17 : 451 - 460 . https://doi.org/10.1038/ni.3368 PMID: 26878113
22. Dempster AP , Laird NM , Rubin DB . Maximum Likelihood from Incomplete Data via the EM Algorithm . J R Stat Soc Ser B Methodol . 1977 ; 39 : 1 - 38 .
23. Bishop C . Pattern Recognition and Machine Learning [Internet] . New York: Springer-Verlag; 2006 . www.springer.com/us/book/9780387310732
24. Arthur D , Vassilvitskii S . K-means++: The Advantages of Careful Seeding . Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms . Philadelphia, PA, USA: Society for Industrial and Applied Mathematics; 2007 . pp. 1027 - 1035 . http://dl.acm.org/citation.cfm?id= 1283383 . 1283494
25. Hubert L , Arabie P . Comparing partitions . J Classif . 1985 ; 2 : 193 - 218 . https://doi.org/10.1007/ BF01908075
Witten DM , Tibshirani R . sparcl: Perform Sparse Hierarchical Clustering and Sparse K-Means Clustering [Internet]. 2018 . https://CRAN.R-project.org/package=sparcl
Finak G , Manuel-Perez J , Gottardo R . flowTrans: Parameter Optimization for Flow Cytometry Data Transformation . 2010 .
Leek JT , Johnson WE , Parker HS , Jaffe AE , Storey JD . The sva package for removing batch effects and other unwanted variation in high-throughput experiments . Bioinforma Oxf Engl . 2012 ; 28 : 882 - 883 . https://doi.org/10.1093/bioinformatics/bts034 PMID: 22257669