ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest (pdf)

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-023-05412-y

ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest

(2023) 24:289 Luo et al. BMC Bioinformatics https://doi.org/10.1186/s12859-023-05412-y RESEARCH BMC Bioinformatics Open Access ForestSubtype: a cancer subtype identifying approach based on high‑dimensional genomic data and a parallel random forest Junwei Luo1, Yading Feng1, Xuyang Wu1, Ruimin Li1, Jiawei Shi1, Wenjing Chang1 and Junfeng Wang1* *Correspondence: 1 School of Software, Henan Polytechnic University, Jiaozuo, China Abstract Background: Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes. Results: In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype. Conclusions: Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification. Keywords: Cancer subtyping, Random forest, Gene expression data, Machine learning, Auto Encoder © The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Luo et al. BMC Bioinformatics (2023) 24:289 Introduction Cancer is a disease closely associated with genetic predisposition, and primarily caused by an imbalance between proliferation and growth-inhibiting apoptosis genes, resulting in abnormal cell proliferation without death [1]. Modern medical research has established that cancer is not a single disease, but rather a collection of hundreds of different diseases. Consequently, cancer can be divided into heterozygous and homozygous cancers. Homozygous cancers can be staged not only according to the stage of cancer development but also according to certain characteristics of the genes in the cancer cells, which allow cancer to be classified into different subtypes [2]. Understanding these cancer subtypes is crucial for developing targeted treatment plans and determining prognosis as cancer subtypes often include valuable information about etiology, cancer biology, and personalized medicine research [3–5]. For one cancer, there maybe have many subtypes, which are significant for treatment. For example, there are currently five traditionally classified subtypes of breast cancer, LumA, LumB, HER2, Basal and Normal, each with different treatment options [6]. Traditional cancer subtype classification may have limitations in implementing precise treatments for patients. Cancers with similar clinical and pathological manifestations may exhibit different behaviors, and identifying targeted and precise treatments based on these different behaviors is the key to treating cancer [6, 7]. To this end, the ability to effectively identify cancer subtypes is crucial for guiding subsequent treatment and improving patient prognosis, making it a meaningful exercise to identify cancer subtypes effectively. High-dimensional gene expression data can be utilized to analyze changes in gene expression, correlations between genes, and gene activity, among other things. Some cancers have been studied to mark subtype categories, which have been used in many areas of research [8, 9]. Consequently, many cancer subtyping methods use high-dimensional gene expression data to detect cancer subtype. Currently, various methods for cancer subtype have been presented, which can be categorized into three categories. (1) Methods based on supervised learning. Guo et al. [10] proposed the method BCDForest, which proposes a multi-class granularity scanning method to train the model while finding important features using a new enhancement strategy. Ahmed et al. [11] proposed a cancer subtype classification method using convolutional networks, which mainly uses the ResNext network model and Transformer encoder for feature extraction and classification. (2) Methods based on unsupervised learning. Classification of unlabelled data is more in line with the scope of the clustering problem. Currently, some cancer subtype classification methods use unsupervised learning methods and high-dimensional gene expression data for cancer subtype classification, but the problem is that cancer subtype with no clinical value will be identified when there is no a priori knowledge to guide them. Witten et al. [12] proposed the method SparseK, which uses a lasso pe (...truncated)