Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors

BMC Bioinformatics, Jun 2022

Precision medicine for cancer treatment relies on an accurate pathological diagnosis. The number of known tumor classes has increased rapidly, and reliance on traditional methods of histopathologic classification alone has become unfeasible. To help reduce variability, validation costs, and standardize the histopathological diagnostic process, supervised machine learning models using DNA-methylation data have been developed for tumor classification. These methods require large labeled training data sets to obtain clinically acceptable classification accuracy. While there is abundant unlabeled epigenetic data across multiple databases, labeling pathology data for machine learning models is time-consuming and resource-intensive, especially for rare tumor types. Semi-supervised learning (SSL) approaches have been used to maximize the utility of labeled and unlabeled data for classification tasks and are effectively applied in genomics. SSL methods have not yet been explored with epigenetic data nor demonstrated beneficial to central nervous system (CNS) tumor classification. This paper explores the application of semi-supervised machine learning on methylation data to improve the accuracy of supervised learning models in classifying CNS tumors. We comprehensively evaluated 11 SSL methods and developed a novel combination approach that included a self-training with editing using support vector machine (SETRED-SVM) model and an L2-penalized, multinomial logistic regression model to obtain high confidence labels from a few labeled instances. Results across eight random forest and neural net models show that the pseudo-labels derived from our SSL method can significantly increase prediction accuracy for 82 CNS tumors and 9 normal controls. The proposed combination of semi-supervised technique and multinomial logistic regression holds the potential to leverage the abundant publicly available unlabeled methylation data effectively. Such an approach is highly beneficial in providing additional training examples, especially for scarce tumor types, to boost the prediction accuracy of supervised models.

Article PDF cannot be displayed. You can download it here:

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-022-04764-1

Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors

(2022) 23:223 Tran et al. BMC Bioinformatics https://doi.org/10.1186/s12859-022-04764-1 BMC Bioinformatics Open Access RESEARCH Comprehensive study of semi‑supervised learning for DNA methylation‑based supervised classification of central nervous system tumors Quynh T. Tran, Md Zahangir Alom and Brent A. Orr* *Correspondence: Department of Pathology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, MS 250, Memphis, TN 38105‑3678, USA Abstract Background: Precision medicine for cancer treatment relies on an accurate pathological diagnosis. The number of known tumor classes has increased rapidly, and reliance on traditional methods of histopathologic classification alone has become unfeasible. To help reduce variability, validation costs, and standardize the histopathological diagnostic process, supervised machine learning models using DNA-methylation data have been developed for tumor classification. These methods require large labeled training data sets to obtain clinically acceptable classification accuracy. While there is abundant unlabeled epigenetic data across multiple databases, labeling pathology data for machine learning models is time-consuming and resource-intensive, especially for rare tumor types. Semi-supervised learning (SSL) approaches have been used to maximize the utility of labeled and unlabeled data for classification tasks and are effectively applied in genomics. SSL methods have not yet been explored with epigenetic data nor demonstrated beneficial to central nervous system (CNS) tumor classification. Results: This paper explores the application of semi-supervised machine learning on methylation data to improve the accuracy of supervised learning models in classifying CNS tumors. We comprehensively evaluated 11 SSL methods and developed a novel combination approach that included a self-training with editing using support vector machine (SETRED-SVM) model and an L2-penalized, multinomial logistic regression model to obtain high confidence labels from a few labeled instances. Results across eight random forest and neural net models show that the pseudo-labels derived from our SSL method can significantly increase prediction accuracy for 82 CNS tumors and 9 normal controls. Conclusions: The proposed combination of semi-supervised technique and multinomial logistic regression holds the potential to leverage the abundant publicly available unlabeled methylation data effectively. Such an approach is highly beneficial in providing additional training examples, especially for scarce tumor types, to boost the prediction accuracy of supervised models. Keywords: Semi-supervised learning, Neural network, Artificial intelligence, Supervised classifiers, DNA-methylation, Central nervous system tumor, Machine learning, Random forest © The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Tran et al. BMC Bioinformatics (2022) 23:223 Background Artificial intelligent (AI) technologies have been widely adopted in the diagnostic process of various biomedical disciplines [1–4]. Furthermore, with the advent of highthroughput technologies such as microarrays and nucleic acid sequencers, the use of machine learning and deep learning has also become increasingly indispensable in the field of cancer genomics [5–7]. The introduction of these advanced computational methods has provided many opportunities to improve health care and increase the precision of oncologic diagnosis. A key challenge in medical science is the precise classification of diseases and the development of optimal therapies. This is particularly more challenging in classifying brain tumors due to the developmental complexity of the brain. The World Health Organization has defined 82 central nervous system (CNS) tumor classes, encompassing a broad spectrum from benign neoplasms, which can be treated by surgery alone, to malignant tumors that respond poorly even with aggression adjuvant therapy. With the advancement in AI and the abundance of genomic and epigenomic data, methylationbased classification of human tumors has emerged as an essential diagnostic tool in the clinical laboratory. Supervised models have been implemented to assist in diagnosing CNS tumors and sarcomas [8, 9]. These initially deployed models are clinically useful but have inherent limitations. Constructing optimal supervised models for methylation-based classification in the clinical environment is dependent on having a comprehensive set of labeled “gold standard” data for training and validation. Unfortunately, the current reference sets are not entirely complete, yielding a significant proportion of unclassifiable tumors [8]. Furthermore, the reference cohorts suffer from a considerable class imbalance due to the lack of sufficient examples of rare tumor types to train supervised classification models, thus, degrading model performance. To fully leverage methylation profiling and machine learning for tumor classification, models should be improved over time by augmenting the training cohorts with additional labeled reference examples of rare tumors and relabeling samples after additional molecular substructures have been identified within known tumor types. In addition, with the vast publically available methylation profiling data, model updates would benefit from combining well-characterized data with relevant tumor profiles acquired from large public repositories. Obtaining additional labeled training data for improving CNS tumor classifiers can be challenging. Current “gold standard” approaches to sample labeling for methylation cohorts include a histomorphologic assessment by expert pathologists, orthogonal molecular testing, and unsupervised methods such as dimensionality reduction or cluster analysis. However, establishing a ground truth methylation class is difficult for a subset of tumors because they lack defining gene abnormalities or copy number changes. Additionally, closely related mol (...truncated)


This is a preview of a remote PDF: https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-022-04764-1
Article home page: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04764-1

Tran, Quynh T., Alom, Md Zahangir, Orr, Brent A.. Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors, BMC Bioinformatics, 2022, pp. 1-17, Volume 23, Issue 1, DOI: 10.1186/s12859-022-04764-1