Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs
PLOS ONE
RESEARCH ARTICLE
Gene expression prediction based on
neighbour connection neural network utilizing
gene interaction graphs
Xuanyu Li1,2, Xuan Zhang ID3,4*, Wenduo He3,4, Deliang Bu5, Sanguo Zhang1,2
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
1 School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China, 2 Key
Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China,
3 Institute for Network Sciences and Cyberspace (INSC), Tsinghua University, Beijing, China,
4 Zhongguancun Laboratory, Beijing, China, 5 School of Statistics, Capital University of Economics and
Business, Beijing, China
*
Abstract
OPEN ACCESS
Citation: Li X, Zhang X, He W, Bu D, Zhang S
(2023) Gene expression prediction based on
neighbour connection neural network utilizing gene
interaction graphs. PLoS ONE 18(2): e0281286.
https://doi.org/10.1371/journal.pone.0281286
Editor: Sathishkumar V E, Hanyang University,
KOREA, REPUBLIC OF
Having observed that gene expressions have a correlation, the Library of Integrated Network-based Cell-Signature program selects 1000 landmark genes to predict the remaining
gene expression value. Further works have improved the prediction result by using deep
learning models. However, these models ignore the latent structure of genes, limiting the
accuracy of the experimental results. We therefore propose a novel neural network named
Neighbour Connection Neural Network(NCNN) to utilize the gene interaction graph information. Comparing to the popular GCN model, our model incorperates the graph information in
a better manner. We validate our model under two different settings and show that our
model promotes prediction accuracy comparing to the other models.
Received: October 17, 2022
Accepted: January 19, 2023
Published: February 6, 2023
Copyright: © 2023 Li et al. This is an open access
article distributed under the terms of the Creative
Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in
any medium, provided the original author and
source are credited.
Data Availability Statement: The third-party data
used for the training are publicly available at https://
cbcl.ics.uci.edu/public_data/D-GEX/. The authors
had no special access privileges, and other
researchers would be able to access this data in the
same manner. The data-preprocessing and python
implementation are publicly available at https://
github.com/Xuanyu-Li/NCNN.
Funding: This work was supported by the National
Natural Science Foundation of China 374
(12171454), and the Key R&D Program of Guangxi
(2020AB10023). The funders had no role in study
Introduction
Gene expression data, which describe the process of converting DNA materials into functional
products [1], has been an important tool for medical diagnosis and gaining insights into complex disease [2, 3]. With the advance in DNA microarray [4] and RNA-seq technologies [5, 6],
the cellular response can be studied through thousands of expression data under a wide variety
of conditions such as diseases, genetic mutations and intake of medicines and drugs. The corresponding study is called gene expression profiling.
Although lots of gene expression data have been collected and deposited [7, 8], whole
genome profiling is still too expensive for broad use since it requires the collection of data with
a large number of genes through various conditions. For example, The initial phase of the
CMap project produced only 564 genome-wide gene expression profiles [9]. One of the solutions to reduce the expense of whole genome profiling is to utilize the high correlation among
different genes [10] and select a group of genes to represent overall genome expression.
Researchers from the LINCS program performed principal components analysis(PCA) and
found that 1,000 carefully chosen genes(named landmark genes) were sufficient to recover
PLOS ONE | https://doi.org/10.1371/journal.pone.0281286 February 6, 2023
1 / 18
PLOS ONE
design, data collection and analysis, decision to
publish, or preparation of the manuscript.
Competing interests: The authors have declared
that no competing interests exist.
Gene expression prediction utilizing gene interaction graphs
80% of the information in the whole genome [11]. Then they developed the L1000 Luminex
bead technology to measure the expression profiles of these 1000 genes at a much lower cost.
Lots of literature have been proposed then based on this cost-effective strategy [12, 13].
Despite the low cost of the L1000 program, one of the natural questions is how to infer
other genes, named target genes, based on these landmark genes. The original paper proposed
by the LINCS program adopts simple linear regression. Although classic and computationally
efficient, linear regression can not capture the nonlinear relationship between landmark genes
and target genes. With the development of deep learning methods, Li et al. [10] proposed a full
connection neural netword-based method D-GEX and achieved better results than linear
regression in both DNA-microarray and RNA sequencing data.
Although D-GEX performs much better than traditional methods, it may be further
improved. D-GEX uses full connection neural network model which implicitly assuames that
landmark genes are interchangeable. In other words, the landmark gene expression data can
be fed into the full connection neural network in any order without affecting the final result.
The motivation of this paper comes from considering whether this implicit assumption holds
for the gene expression data. As shown in many biology studies [14, 15], the genes have an
inherent structure, at the same time, cells can coordinate the regulation of many genes at once.
Thus, the D-GEX model neglects the latent structure of the landmark genes, and it is beneficial
to incorporate exterior information which gives the structure of genes into the deep learning
method. The gene interaction graph, which depicts such coordination by giving functional
biological interaction between two genes is a perfect candidate. In the gene interaction graph,
nodes represent genes, and edges represent the functional biological interaction between two
genes. There have been many gene interaction graphs constructed from different molecular
levels(Szklarczyk et al. [16]; Warde-Farley et al. [17];): protein-protein interaction, transcription factors, and gene co-expression are the common material to construct gene interaction
graph. Another aspect is that in the deep learning literature, the processing of graph data has
recently drawn a major interest [18]. Any neural network working on the graph data can be
categorized as a graph neural network(GNN). In particular, graph convolutional network
(GCN) has been a predominant approach [19] among the graph neural networks.
In this paper, we will briefly introduce the classical graph convolutional network structure
and then compare the GCN w (...truncated)