Improve consensus partitioning via a hierarchical procedure.
Briefings in Bioinformatics, 2022, 23(3), 1–13
https://doi.org/10.1093/bib/bbac048
Problem Solving Protocol
Improve consensus partitioning via a hierarchical
procedure
Zuguang Gu
and Daniel Hübschmann†
Corresponding author: Zuguang Gu, Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, Im Neuenheimer Feld 280,
Heidelberg 69120, Germany. Tel.: +49 6221 42 3607; E-mail: .
† Daniel Hübschmann, Molecular Precision Oncology Program, National Center for Tumor Diseases (NCT) Heidelberg, Im Neuenheimer Feld 280, 69120 Heidelberg,
Germany. Heidelberg Institute of Stem Cell Technology and Experimental Medicine (HI-STEM), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. German
Cancer Consortium (DKTK), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. Department of Pediatric Immunology, Hematology and Oncology, University
Hospital Heidelberg, 69120 Heidelberg, Germany. E-mail:
Abstract
Consensus partitioning is an unsupervised method widely used in high-throughput data analysis for revealing subgroups and
assigning stability for the classification. However, standard consensus partitioning procedures are weak for identifying large numbers
of stable subgroups. There are two major issues. First, subgroups with small differences are difficult to be separated if they are
simultaneously detected with subgroups with large differences. Second, stability of classification generally decreases as the number
of subgroups increases. In this work, we proposed a new strategy to solve these two issues by applying consensus partitioning in a
hierarchical procedure. We demonstrated hierarchical consensus partitioning can be efficient to reveal more meaningful subgroups.
We also tested the performance of hierarchical consensus partitioning on revealing a great number of subgroups with a large
deoxyribonucleic acid methylation dataset. The hierarchical consensus partitioning is implemented in the R package cola with
comprehensive functionalities for analysis and visualization. It can also automate the analysis only with a minimum of two lines of
code, which generates a detailed HTML report containing the complete analysis. The cola package is available at https://bioconductor.
org/packages/cola/.
Keywords: consensus partitioning, unsupervised classification, hierarchical method, Bioconductor, R package
Introduction
Consensus partitioning or consensus clustering is an
unsupervised learning method that classifies samples
into subgroups and evaluates the stability of the classification by resampling from original data [1]. It has become
an important tool applied in high-throughput data analysis e.g. to reveal cancer subtypes [2] or to validate the
agreement of the classification on known clinical factors.
In our previous work [3], we developed an R/Bioconductor
package named cola that provides a general framework
for consensus partitioning. It allows simultaneously running multiple feature selection methods and partitioning
methods and it provides comprehensive visualization
and reporting utilities for automatic and deep interpretation on the results. Cola provides a new and efficient
method named ATC (ability to correlate to other rows)
for extracting top features and it recommends spherical k-means clustering [4] for subgroup classification.
Through comprehensive benchmarks on public datasets,
we demonstrated cola was able to generate new, stable
and biologically meaningful classifications.
Cola provides a convenient toolkit for performing
consensus partitioning analysis. It performs well when
the expected number of subgroups is relatively small e.g.
no larger than six as demonstrated in our previous study
[3]. However, when the number of expected subgroups
increases, issues for general consensus partitioning
procedures [5, 6] rise and they would significantly affect
the classification. In consensus partitioning procedures,
first the top n features scored by a certain method
e.g. standard deviation (SD), are selected. Later, sample
classification is only applied to the top features. A
good classification is expected to select those features
which have the ability to separate all subgroups,
in other words, consensus partitioning procedures
take into account all samples equally. However, in
real-world datasets, this condition cannot always be
met. It is possible that features good at separating
major subgroups (i.e. subgroups with large difference)
are weak for secondary subgroups (i.e. subgroups
with small difference) if the secondary subgroups
have different sets of features that are efficient for
Zuguang Gu is a senior scientist at the National Center for Tumor Disease, Heidelberg, Germany. His research interests include statistical analysis on various
types of high-throughput data. He is also active in R/Bioconductor package development for data visualization and analysis.
Daniel Huebschmann is a group leader of the Molecular Precision Oncology Program, National Center for Tumor Disease, Heidelberg, Germany.
Received: November 5, 2021. Revised: January 20, 2022. Accepted: January 30, 2022
© The Author(s) 2022. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/
by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial
re-use, please contact
2
|
Gu and Hübschmann
classification. When the real number of subgroups
becomes larger, it is highly possible that subgroups
have different sets of efficient features for classification,
and this leads to the effect that it may be difficult to
reach stable separation for secondary subgroups when
classifying them with major subgroups at the same
time. The second issue is that, when the number of
subgroups gets larger, the probability of two samples
to be in different subgroups tends to increase, which
results in the loss of stability of the classification. Both
issues hinder the classification to reach a large number
of stable subgroups.
In this work, to solve the previously raised issues,
we propose a strategy named hierarchical consensus
partitioning (HCP) that applies standard cola consensus
partitioning (CP) in a hierarchical procedure. Simply
speaking, one could first classify samples into k groups
where k is a small number which corresponds to
major subgroups. Then for each subgroup of samples,
one could repeatedly apply CP with a new set of top
features extracted only to that subset of samples. The
hierarchical procedure stops until certain criteria are
reached. By these means, theoretically, small subgroups
or secondary subgroups could be detected in later steps
of the hierarchical procedure. This process can generate
a hierarchy of subgroups where subsets of samples
are represented as nodes. The idea of executing CP
hierarchically has also been applied when i (...truncated)