Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters

Mar 2016

Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.

Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters

RESEARCH ARTICLE Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters Paola Tellaroli1*, Marco Bazzi1, Michele Donato2, Alessandra R. Brazzale1, Sorin Drăghici2,3 1 Department of Statistical Sciences, University of Padova, Padova, Italy, 2 Department of Computer Science, Wayne State University, Detroit, MI, United States of America, 3 Department of Obstetrics and Gynecology, Wayne State University School of Medicine, Detroit, MI, United States of America * Abstract OPEN ACCESS Citation: Tellaroli P, Bazzi M, Donato M, Brazzale AR, Drăghici S (2016) Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters. PLoS ONE 11(3): e0152333. doi:10.1371/journal.pone.0152333 Editor: Hans A Kestler, University of Ulm, GERMANY Received: August 17, 2015 Accepted: March 11, 2016 Published: March 25, 2016 Copyright: © 2016 Tellaroli et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: The brain tumor dataset is available for free download at http://www. broadinstitute.org/MPR/CNS/, while the breast cancer dataset corresponds to the dataset GSE38888 from the Gene Expression Omnibus database (http://www. ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE38888). Funding: This work was supported by Fondazione Cassa di Risparmio di Padova e Rovigo (http://www. fondazionecariparo.net/) grant number PARO112419 (grant recipient: ARB), National Institute of Health (http://www.nih.gov/) grant number RO1 RDK089167 (grant recipient: SD), National Institute of Health (http://www.nih.gov/) grant number R42 GM087013 Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository. Introduction Clustering is the process of partitioning elements into a number of groups (clusters) such that elements in the same cluster are more similar than elements in different clusters. Clustering has been applied in a wide variety of fields, ranging from medical sciences, economics, computer sciences, engineering, social sciences, to earth sciences [1, 2], reflecting its important role in scientific research. With several hundred clustering methods in existence [3], there is clearly no shortage of clustering algorithms but, at the same time, satisfactory answers to some basic questions are still to come. PLOS ONE | DOI:10.1371/journal.pone.0152333 March 25, 2016 1 / 14 Cross-Clustering: A Partial Clustering Algorithm (grant recipient: SD), and National Science Foundation (http://www.nsf.gov/) grant number DBI0965741 (grant recipient: SD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. Clustering methods are nowadays essential tools for the analysis of gene expression data, becoming routinely used in many research projects [4]. Many papers have shown that genes or proteins of similar function cluster together [5–10], and clustering methods have been used to solve many problems of biological nature. One of the most interesting of these problems is related to disease subtyping, i.e. the stratification of different patients in terms of underlying disease characteristics. This is extremely important in the drug development process, in which the correct identification of the subgroup of patients who are most likely to respond to the drug may be needed in order to get the drug approved by FDA. Also, ultimately, disease subtyping is expected to be the key for personalized therapies. A widely used type of clustering is K-means [11–13], the best known squared error-based clustering algorithm [14]. This method consists in initializing a number of random centroids, one for each cluster, and then associating each element to the nearest centroid. This procedure is repeated until the locations of the centroids do not change anymore. A similar clustering algorithm is Partition Around Medoids (PAM) [15], which intend to find a sequence of elements called medoids that are centrally located in clusters, with the goal to minimize the sum of the dissimilarities of all elements to their nearest medoid. Also Affinity Propagation [16] starts from a similar idea, identifying exemplars among data points and building clusters around these exemplars. Another widely used clustering algorithm is Spectral clustering, which makes use of the eigenvalues of the similarity matrix of the data before clustering. Many of the most widely used clustering methods, including K-means, PAM, and Spectral clustering, require the estimation of the most appropriate number of clusters for the data. Ideally, the resulting clusters should not only have good properties (compact, well-separated, and stable), but also give biologically meaningful results. This is an issue that derives from the more general problem of defining the term “cluster” [3] and has been extensively treated in the literature [17]. Furthermore, K-means is not a deterministic method, because the results are dependent on the initialization of the algorithm and can change between successive runs. The same not-deterministic property is shared by SOM [18], a neural network clustering method, which, even if it does not need the number of clusters to be defined a priori, requires the user to specify the maximum number of clusters. Another similar clustering algorithm we consider is AutoSOME [19], which, (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0152333&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152333

Paola Tellaroli, Marco Bazzi, Michele Donato, Alessandra R. Brazzale, Sorin Drăghici. Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters, 2016, Volume 11, Issue 3, DOI: 10.1371/journal.pone.0152333