Improve consensus partitioning via a hierarchical procedure. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9116221/pdf/

Improve consensus partitioning via a hierarchical procedure.

Briefings in Bioinformatics, 2022, 23(3), 1–13 https://doi.org/10.1093/bib/bbac048 Problem Solving Protocol Improve consensus partitioning via a hierarchical procedure Zuguang Gu and Daniel Hübschmann† Corresponding author: Zuguang Gu, Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, Im Neuenheimer Feld 280, Heidelberg 69120, Germany. Tel.: +49 6221 42 3607; E-mail: . † Daniel Hübschmann, Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. Heidelberg Institute of Stem Cell Technology and Experimental Medicine (HI-STEM), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. German Cancer Consortium (DKTK), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. Department of Pediatric Immunology, Hematology and Oncology, University Hospital Heidelberg, 69120 Heidelberg, Germany. E-mail: Abstract Consensus partitioning is an unsupervised method widely used in high-throughput data analysis for revealing subgroups and assigning stability for the classification. However, standard consensus partitioning procedures are weak for identifying large numbers of stable subgroups. There are two major issues. First, subgroups with small differences are difficult to be separated if they are simultaneously detected with subgroups with large differences. Second, stability of classification generally decreases as the number of subgroups increases. In this work, we proposed a new strategy to solve these two issues by applying consensus partitioning in a hierarchical procedure. We demonstrated hierarchical consensus partitioning can be efficient to reveal more meaningful subgroups. We also tested the performance of hierarchical consensus partitioning on revealing a great number of subgroups with a large deoxyribonucleic acid methylation dataset. The hierarchical consensus partitioning is implemented in the R package cola with comprehensive functionalities for analysis and visualization. It can also automate the analysis only with a minimum of two lines of code, which generates a detailed HTML report containing the complete analysis. The cola package is available at https://bioconductor. org/packages/cola/. Keywords: consensus partitioning, unsupervised classification, hierarchical method, Bioconductor, R package Introduction Consensus partitioning or consensus clustering is an unsupervised learning method that classifies samples into subgroups and evaluates the stability of the classification by resampling from original data [1]. It has become an important tool applied in high-throughput data analysis e.g. to reveal cancer subtypes [2] or to validate the agreement of the classification on known clinical factors. In our previous work [3], we developed an R/Bioconductor package named cola that provides a general framework for consensus partitioning. It allows simultaneously running multiple feature selection methods and partitioning methods and it provides comprehensive visualization and reporting utilities for automatic and deep interpretation on the results. Cola provides a new and efficient method named ATC (ability to correlate to other rows) for extracting top features and it recommends spherical k-means clustering [4] for subgroup classification. Through comprehensive benchmarks on public datasets, we demonstrated cola was able to generate new, stable and biologically meaningful classifications. Cola provides a convenient toolkit for performing consensus partitioning analysis. It performs well when the expected number of subgroups is relatively small e.g. no larger than six as demonstrated in our previous study [3]. However, when the number of expected subgroups increases, issues for general consensus partitioning procedures [5, 6] rise and they would significantly affect the classification. In consensus partitioning procedures, first the top n features scored by a certain method e.g. standard deviation (SD), are selected. Later, sample classification is only applied to the top features. A good classification is expected to select those features which have the ability to separate all subgroups, in other words, consensus partitioning procedures take into account all samples equally. However, in real-world datasets, this condition cannot always be met. It is possible that features good at separating major subgroups (i.e. subgroups with large difference) are weak for secondary subgroups (i.e. subgroups with small difference) if the secondary subgroups have different sets of features that are efficient for Zuguang Gu is a senior scientist at the National Center for Tumor Disease, Heidelberg, Germany. His research interests include statistical analysis on various types of high-throughput data. He is also active in R/Bioconductor package development for data visualization and analysis. Daniel Huebschmann is a group leader of the Molecular Precision Oncology Program, National Center for Tumor Disease, Heidelberg, Germany. Received: November 5, 2021. Revised: January 20, 2022. Accepted: January 30, 2022 © The Author(s) 2022. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/ by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact 2 | Gu and Hübschmann classification. When the real number of subgroups becomes larger, it is highly possible that subgroups have different sets of efficient features for classification, and this leads to the effect that it may be difficult to reach stable separation for secondary subgroups when classifying them with major subgroups at the same time. The second issue is that, when the number of subgroups gets larger, the probability of two samples to be in different subgroups tends to increase, which results in the loss of stability of the classification. Both issues hinder the classification to reach a large number of stable subgroups. In this work, to solve the previously raised issues, we propose a strategy named hierarchical consensus partitioning (HCP) that applies standard cola consensus partitioning (CP) in a hierarchical procedure. Simply speaking, one could first classify samples into k groups where k is a small number which corresponds to major subgroups. Then for each subgroup of samples, one could repeatedly apply CP with a new set of top features extracted only to that subset of samples. The hierarchical procedure stops until certain criteria are reached. By these means, theoretically, small subgroups or secondary subgroups could be detected in later steps of the hierarchical procedure. This process can generate a hierarchy of subgroups where subsets of samples are represented as nodes. The idea of executing CP hierarchically has also been applied when i (...truncated)