Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors
(2022) 23:223
Tran et al. BMC Bioinformatics
https://doi.org/10.1186/s12859-022-04764-1
BMC Bioinformatics
Open Access
RESEARCH
Comprehensive study of semi‑supervised
learning for DNA methylation‑based supervised
classification of central nervous system tumors
Quynh T. Tran, Md Zahangir Alom and Brent A. Orr*
*Correspondence:
Department of Pathology,
St. Jude Children’s Research
Hospital, 262 Danny Thomas
Place, MS 250, Memphis, TN
38105‑3678, USA
Abstract
Background: Precision medicine for cancer treatment relies on an accurate pathological diagnosis. The number of known tumor classes has increased rapidly, and reliance
on traditional methods of histopathologic classification alone has become unfeasible.
To help reduce variability, validation costs, and standardize the histopathological diagnostic process, supervised machine learning models using DNA-methylation data have
been developed for tumor classification. These methods require large labeled training
data sets to obtain clinically acceptable classification accuracy. While there is abundant unlabeled epigenetic data across multiple databases, labeling pathology data for
machine learning models is time-consuming and resource-intensive, especially for rare
tumor types. Semi-supervised learning (SSL) approaches have been used to maximize
the utility of labeled and unlabeled data for classification tasks and are effectively
applied in genomics. SSL methods have not yet been explored with epigenetic data
nor demonstrated beneficial to central nervous system (CNS) tumor classification.
Results: This paper explores the application of semi-supervised machine learning on
methylation data to improve the accuracy of supervised learning models in classifying
CNS tumors. We comprehensively evaluated 11 SSL methods and developed a novel
combination approach that included a self-training with editing using support vector
machine (SETRED-SVM) model and an L2-penalized, multinomial logistic regression
model to obtain high confidence labels from a few labeled instances. Results across
eight random forest and neural net models show that the pseudo-labels derived from
our SSL method can significantly increase prediction accuracy for 82 CNS tumors and 9
normal controls.
Conclusions: The proposed combination of semi-supervised technique and multinomial logistic regression holds the potential to leverage the abundant publicly available unlabeled methylation data effectively. Such an approach is highly beneficial in
providing additional training examples, especially for scarce tumor types, to boost the
prediction accuracy of supervised models.
Keywords: Semi-supervised learning, Neural network, Artificial intelligence,
Supervised classifiers, DNA-methylation, Central nervous system tumor, Machine
learning, Random forest
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Tran et al. BMC Bioinformatics
(2022) 23:223
Background
Artificial intelligent (AI) technologies have been widely adopted in the diagnostic process of various biomedical disciplines [1–4]. Furthermore, with the advent of highthroughput technologies such as microarrays and nucleic acid sequencers, the use of
machine learning and deep learning has also become increasingly indispensable in the
field of cancer genomics [5–7]. The introduction of these advanced computational methods has provided many opportunities to improve health care and increase the precision
of oncologic diagnosis.
A key challenge in medical science is the precise classification of diseases and the
development of optimal therapies. This is particularly more challenging in classifying brain tumors due to the developmental complexity of the brain. The World Health
Organization has defined 82 central nervous system (CNS) tumor classes, encompassing a broad spectrum from benign neoplasms, which can be treated by surgery alone, to
malignant tumors that respond poorly even with aggression adjuvant therapy. With the
advancement in AI and the abundance of genomic and epigenomic data, methylationbased classification of human tumors has emerged as an essential diagnostic tool in the
clinical laboratory. Supervised models have been implemented to assist in diagnosing
CNS tumors and sarcomas [8, 9].
These initially deployed models are clinically useful but have inherent limitations.
Constructing optimal supervised models for methylation-based classification in the clinical environment is dependent on having a comprehensive set of labeled “gold standard”
data for training and validation. Unfortunately, the current reference sets are not entirely
complete, yielding a significant proportion of unclassifiable tumors [8]. Furthermore, the
reference cohorts suffer from a considerable class imbalance due to the lack of sufficient
examples of rare tumor types to train supervised classification models, thus, degrading
model performance.
To fully leverage methylation profiling and machine learning for tumor classification,
models should be improved over time by augmenting the training cohorts with additional labeled reference examples of rare tumors and relabeling samples after additional
molecular substructures have been identified within known tumor types. In addition,
with the vast publically available methylation profiling data, model updates would benefit from combining well-characterized data with relevant tumor profiles acquired from
large public repositories.
Obtaining additional labeled training data for improving CNS tumor classifiers can
be challenging. Current “gold standard” approaches to sample labeling for methylation cohorts include a histomorphologic assessment by expert pathologists, orthogonal
molecular testing, and unsupervised methods such as dimensionality reduction or cluster analysis. However, establishing a ground truth methylation class is difficult for a subset of tumors because they lack defining gene abnormalities or copy number changes.
Additionally, closely related mol (...truncated)