A multi-representation deep-learning framework for accurate multicancer classification
Journal of Translational
Medicine
He et al. Journal of Translational Medicine
(2025) 23:1317
https://doi.org/10.1186/s12967-025-07325-1
Open Access
RESEARCH
A multi-representation deep-learning
framework for accurate multicancer
classification
Guojing He1,2, Xiao Yang2, Wang Yu2, Mingze Bai2, Dan Pu2*
and Kunxian Shu2*
Abstract
Background Accurate multicancer classification constitutes a cornerstone of modern oncology, offering critical
insights into diagnosis, therapeutic decision-making, and prognostication. Numerous existing approaches, however,
remain restricted to limited cancer types and typically encode genomic information into a single representational
modality. The purpose of this study was to develop and evaluate a novel framework by integrating complementary,
mutation-derived features to advance cancer classification.
Methods We present GraphVar, a multi-representation deep learning framework that integrates mutation-derived
imaging and numeric genomic features for multicancer classification. GraphVar generates a spatial variant map by
encoding gene-level variant categories as pixel intensities. In parallel, it constructs a numeric feature matrix capturing
population allele frequencies and mutation spectra. GraphVar employs a ResNet-18 backbone to extract image-level
features, a Transformer encoder to model numeric profiles, and a fusion module to integrate both modalities. Model
interpretability was assessed by gradient-weighted class activation mapping (Grad-CAM), and functional relevance
was validated utilizing the Kyoto Encyclopedia of Genes and Genomes (KEGG)-based pathway enrichment analysis.
Results In a cohort of 10,112 patients spanning 33 cancer types, GraphVar achieved a precision of 99.85%, a recall of
99.82%, an F1-score of 99.82%, and an accuracy of 99.82%. Grad-CAM highlighted the model’s ability to localize genelevel molecular patterns and prioritize biologically relevant candidates. The KEGG-based pathway enrichment analysis
of kidney renal clear cell carcinoma (KIRC) and breast invasive carcinoma (BRCA) samples supported the biological
relevance of GraphVar-identified genes, demonstrating its capacity to capture functionally meaningful genomic
signatures.
Conclusions These findings demonstrate GraphVar as a robust and interpretable framework for multicancer
classification. The model’s high accuracy and its ability to identify functionally meaningful genomic signatures
indicate its potential as a tool to support precision diagnostics and therapeutic strategies, warranting further
translational studies.
Keywords Multi-representation, Deep learning, Transformer, Variant
*Correspondence:
Dan Pu
Kunxian Shu
1
College of Computer Science and Technology, Chongqing University of
Posts and Telecommunications, No. 2 Chongwen Road,
Chongqing 400065, China
2
Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing
University of Posts and Telecommunications, No. 2 Chongwen Road,
Chongqing 400065, China
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you
give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the
licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creati
vecommons.org/licenses/by-nc-nd/4.0/.
He et al. Journal of Translational Medicine
(2025) 23:1317
Background
Cancer remains the second leading cause of mortality worldwide, with over 8 million fatalities annually;
it is anticipated that the incidence of cancer will rise by
over 50% in the coming decades [1, 2]. Accurate determination of cancer types holds immense potential to
enhance outcome predictions, guide therapy selection,
and deepen our understanding of heterogeneity. Cancer classification has traditionally relied on molecular
characteristics. However, these molecular-based classifications fail to fully account for cancer heterogeneity
[3, 4]. Consequently, there is an urgent requirement to
develop robust and scalable methodologies to advance
cancer classification. Advancements in next-generation
sequencing (NGS) technologies have facilitated the comprehensive characterization of diverse genomic alterations in a substantial number of tumor cohorts [2, 5, 6].
Researchers have discovered that cancer development is
predominantly driven by the progressive accumulation
of somatic variants [7, 8]. Importantly, the mutational
landscape exhibits pronounced heterogeneity across distinct cancer types. Specific cancer types are frequently
characterized by variants, amplifications, or deletions of
specific oncogenes or tumor suppressor genes that are
infrequently or rarely observed in other cancer types [5,
9, 10]. For instance, lung tumors are enriched for G > T
transversions attributable to exposure to polycyclic aromatic hydrocarbons from tobacco smoke [5], whereas
melanomas are characterized by a predominance of C > T
substitutions arising from UV-induced DNA damage and
misrepair [11].
Recent advances in machine learning (ML) have led to
highly accurate, robust, and reproducible performance
across a wide spectrum of diagnostic tasks in medicine,
particularly in identifying characteristics that are not typically recognized by human experts [12]. Consequently,
a number of ML-based approaches have been developed
for cancer prediction and classification based on the
analysis of somatic genomic alterations. Early work by
Chen et al. introduced an ML-based framework for cancer site classification employing somatic variant profiles,
achieving 62% accuracy across 17 tumor types. Their
findings suggested that variant-derived signatures could
support the refinement of molecularly targeted therapies [13]. Subsequent studies by Zelli et al. and Soh et
al. investigated ML-based classifiers trained on somatic
point mutations (SPMs) and copy number variations
(CNVs) to improve cancer-type prediction. Their results
demonstrated that the integrative modeling of SPMs and
CNVs substantially enhanced the predictive accuracy of
ML-based diagnostic frameworks [14, 15]. More recently,
Nguyen et al. reported approximately 90% accuracy in
the classification of 35 cancer subtypes by leveraging ML
Page 2 of 14
models trained on composite features derived from both
d (...truncated)