ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest
(2023) 24:289
Luo et al. BMC Bioinformatics
https://doi.org/10.1186/s12859-023-05412-y
RESEARCH
BMC Bioinformatics
Open Access
ForestSubtype: a cancer subtype identifying
approach based on high‑dimensional genomic
data and a parallel random forest
Junwei Luo1, Yading Feng1, Xuyang Wu1, Ruimin Li1, Jiawei Shi1, Wenjing Chang1 and Junfeng Wang1*
*Correspondence:
1
School of Software, Henan
Polytechnic University, Jiaozuo,
China
Abstract
Background: Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype
based on high dimensional gene expression data, it is difficult to obtain satisfactory
classification results. Meanwhile, some cancers have been well studied and classified
to some subtypes, which are adopt by most researchers. Hence, this priori knowledge
is significant for further identifying new meaningful subtypes.
Results: In this paper, we present a combined parallel random forest and autoencoder
approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori
knowledge of cancer subtype to train a module and extract significant candidate
features. Second, ForestSubtype uses a random forest as the base module and ten
parallel random forests to compute each feature weight and rank them separately.
Then, the intersection of the features with the larger weights output by the ten parallel
random forests is taken as our subsequent candidate features. Third, ForestSubtype
uses an autoencoder to condenses the selected features into a two-dimensional data.
Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The
Cancer Genome Atlas are used for training and validation, and an independent breast
cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two
methods in terms of the distribution of clusters, internal and external metric results. The
open-source code is available at https://github.com/lffyd/ForestSubtype.
Conclusions: Our work shows that the combination of high-dimensional gene
expression data and parallel random forests and autoencoder, guided by a priori
knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.
Keywords: Cancer subtyping, Random forest, Gene expression data, Machine
learning, Auto Encoder
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Luo et al. BMC Bioinformatics
(2023) 24:289
Introduction
Cancer is a disease closely associated with genetic predisposition, and primarily caused
by an imbalance between proliferation and growth-inhibiting apoptosis genes, resulting
in abnormal cell proliferation without death [1].
Modern medical research has established that cancer is not a single disease, but rather
a collection of hundreds of different diseases. Consequently, cancer can be divided into
heterozygous and homozygous cancers. Homozygous cancers can be staged not only
according to the stage of cancer development but also according to certain characteristics of the genes in the cancer cells, which allow cancer to be classified into different
subtypes [2]. Understanding these cancer subtypes is crucial for developing targeted
treatment plans and determining prognosis as cancer subtypes often include valuable
information about etiology, cancer biology, and personalized medicine research [3–5].
For one cancer, there maybe have many subtypes, which are significant for treatment.
For example, there are currently five traditionally classified subtypes of breast cancer,
LumA, LumB, HER2, Basal and Normal, each with different treatment options [6].
Traditional cancer subtype classification may have limitations in implementing precise treatments for patients. Cancers with similar clinical and pathological manifestations may exhibit different behaviors, and identifying targeted and precise treatments
based on these different behaviors is the key to treating cancer [6, 7]. To this end, the
ability to effectively identify cancer subtypes is crucial for guiding subsequent treatment
and improving patient prognosis, making it a meaningful exercise to identify cancer subtypes effectively.
High-dimensional gene expression data can be utilized to analyze changes in gene
expression, correlations between genes, and gene activity, among other things. Some
cancers have been studied to mark subtype categories, which have been used in many
areas of research [8, 9]. Consequently, many cancer subtyping methods use high-dimensional gene expression data to detect cancer subtype.
Currently, various methods for cancer subtype have been presented, which can be categorized into three categories.
(1) Methods based on supervised learning. Guo et al. [10] proposed the method BCDForest, which proposes a multi-class granularity scanning method to train the
model while finding important features using a new enhancement strategy. Ahmed
et al. [11] proposed a cancer subtype classification method using convolutional networks, which mainly uses the ResNext network model and Transformer encoder
for feature extraction and classification.
(2) Methods based on unsupervised learning. Classification of unlabelled data is more
in line with the scope of the clustering problem. Currently, some cancer subtype
classification methods use unsupervised learning methods and high-dimensional
gene expression data for cancer subtype classification, but the problem is that cancer subtype with no clinical value will be identified when there is no a priori knowledge to guide them. Witten et al. [12] proposed the method SparseK, which uses
a lasso pe (...truncated)