A multi-representation deep-learning framework for accurate multicancer classification (pdf)

Article PDF cannot be displayed. You can download it here:

https://translational-medicine.biomedcentral.com/counter/pdf/10.1186/s12967-025-07325-1

A multi-representation deep-learning framework for accurate multicancer classification

Journal of Translational Medicine He et al. Journal of Translational Medicine (2025) 23:1317 https://doi.org/10.1186/s12967-025-07325-1 Open Access RESEARCH A multi-representation deep-learning framework for accurate multicancer classification Guojing He1,2, Xiao Yang2, Wang Yu2, Mingze Bai2, Dan Pu2* and Kunxian Shu2* Abstract Background Accurate multicancer classification constitutes a cornerstone of modern oncology, offering critical insights into diagnosis, therapeutic decision-making, and prognostication. Numerous existing approaches, however, remain restricted to limited cancer types and typically encode genomic information into a single representational modality. The purpose of this study was to develop and evaluate a novel framework by integrating complementary, mutation-derived features to advance cancer classification. Methods We present GraphVar, a multi-representation deep learning framework that integrates mutation-derived imaging and numeric genomic features for multicancer classification. GraphVar generates a spatial variant map by encoding gene-level variant categories as pixel intensities. In parallel, it constructs a numeric feature matrix capturing population allele frequencies and mutation spectra. GraphVar employs a ResNet-18 backbone to extract image-level features, a Transformer encoder to model numeric profiles, and a fusion module to integrate both modalities. Model interpretability was assessed by gradient-weighted class activation mapping (Grad-CAM), and functional relevance was validated utilizing the Kyoto Encyclopedia of Genes and Genomes (KEGG)-based pathway enrichment analysis. Results In a cohort of 10,112 patients spanning 33 cancer types, GraphVar achieved a precision of 99.85%, a recall of 99.82%, an F1-score of 99.82%, and an accuracy of 99.82%. Grad-CAM highlighted the model’s ability to localize genelevel molecular patterns and prioritize biologically relevant candidates. The KEGG-based pathway enrichment analysis of kidney renal clear cell carcinoma (KIRC) and breast invasive carcinoma (BRCA) samples supported the biological relevance of GraphVar-identified genes, demonstrating its capacity to capture functionally meaningful genomic signatures. Conclusions These findings demonstrate GraphVar as a robust and interpretable framework for multicancer classification. The model’s high accuracy and its ability to identify functionally meaningful genomic signatures indicate its potential as a tool to support precision diagnostics and therapeutic strategies, warranting further translational studies. Keywords Multi-representation, Deep learning, Transformer, Variant *Correspondence: Dan Pu Kunxian Shu 1 College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Chongqing 400065, China 2 Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Chongqing 400065, China © The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creati vecommons.org/licenses/by-nc-nd/4.0/. He et al. Journal of Translational Medicine (2025) 23:1317 Background Cancer remains the second leading cause of mortality worldwide, with over 8 million fatalities annually; it is anticipated that the incidence of cancer will rise by over 50% in the coming decades [1, 2]. Accurate determination of cancer types holds immense potential to enhance outcome predictions, guide therapy selection, and deepen our understanding of heterogeneity. Cancer classification has traditionally relied on molecular characteristics. However, these molecular-based classifications fail to fully account for cancer heterogeneity [3, 4]. Consequently, there is an urgent requirement to develop robust and scalable methodologies to advance cancer classification. Advancements in next-generation sequencing (NGS) technologies have facilitated the comprehensive characterization of diverse genomic alterations in a substantial number of tumor cohorts [2, 5, 6]. Researchers have discovered that cancer development is predominantly driven by the progressive accumulation of somatic variants [7, 8]. Importantly, the mutational landscape exhibits pronounced heterogeneity across distinct cancer types. Specific cancer types are frequently characterized by variants, amplifications, or deletions of specific oncogenes or tumor suppressor genes that are infrequently or rarely observed in other cancer types [5, 9, 10]. For instance, lung tumors are enriched for G > T transversions attributable to exposure to polycyclic aromatic hydrocarbons from tobacco smoke [5], whereas melanomas are characterized by a predominance of C > T substitutions arising from UV-induced DNA damage and misrepair [11]. Recent advances in machine learning (ML) have led to highly accurate, robust, and reproducible performance across a wide spectrum of diagnostic tasks in medicine, particularly in identifying characteristics that are not typically recognized by human experts [12]. Consequently, a number of ML-based approaches have been developed for cancer prediction and classification based on the analysis of somatic genomic alterations. Early work by Chen et al. introduced an ML-based framework for cancer site classification employing somatic variant profiles, achieving 62% accuracy across 17 tumor types. Their findings suggested that variant-derived signatures could support the refinement of molecularly targeted therapies [13]. Subsequent studies by Zelli et al. and Soh et al. investigated ML-based classifiers trained on somatic point mutations (SPMs) and copy number variations (CNVs) to improve cancer-type prediction. Their results demonstrated that the integrative modeling of SPMs and CNVs substantially enhanced the predictive accuracy of ML-based diagnostic frameworks [14, 15]. More recently, Nguyen et al. reported approximately 90% accuracy in the classification of 35 cancer subtypes by leveraging ML Page 2 of 14 models trained on composite features derived from both d (...truncated)