Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0281286&type=printable

Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs

PLOS ONE RESEARCH ARTICLE Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs Xuanyu Li1,2, Xuan Zhang ID3,4*, Wenduo He3,4, Deliang Bu5, Sanguo Zhang1,2 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 1 School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China, 2 Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China, 3 Institute for Network Sciences and Cyberspace (INSC), Tsinghua University, Beijing, China, 4 Zhongguancun Laboratory, Beijing, China, 5 School of Statistics, Capital University of Economics and Business, Beijing, China * Abstract OPEN ACCESS Citation: Li X, Zhang X, He W, Bu D, Zhang S (2023) Gene expression prediction based on neighbour connection neural network utilizing gene interaction graphs. PLoS ONE 18(2): e0281286. https://doi.org/10.1371/journal.pone.0281286 Editor: Sathishkumar V E, Hanyang University, KOREA, REPUBLIC OF Having observed that gene expressions have a correlation, the Library of Integrated Network-based Cell-Signature program selects 1000 landmark genes to predict the remaining gene expression value. Further works have improved the prediction result by using deep learning models. However, these models ignore the latent structure of genes, limiting the accuracy of the experimental results. We therefore propose a novel neural network named Neighbour Connection Neural Network(NCNN) to utilize the gene interaction graph information. Comparing to the popular GCN model, our model incorperates the graph information in a better manner. We validate our model under two different settings and show that our model promotes prediction accuracy comparing to the other models. Received: October 17, 2022 Accepted: January 19, 2023 Published: February 6, 2023 Copyright: © 2023 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: The third-party data used for the training are publicly available at https:// cbcl.ics.uci.edu/public_data/D-GEX/. The authors had no special access privileges, and other researchers would be able to access this data in the same manner. The data-preprocessing and python implementation are publicly available at https:// github.com/Xuanyu-Li/NCNN. Funding: This work was supported by the National Natural Science Foundation of China 374 (12171454), and the Key R&D Program of Guangxi (2020AB10023). The funders had no role in study Introduction Gene expression data, which describe the process of converting DNA materials into functional products [1], has been an important tool for medical diagnosis and gaining insights into complex disease [2, 3]. With the advance in DNA microarray [4] and RNA-seq technologies [5, 6], the cellular response can be studied through thousands of expression data under a wide variety of conditions such as diseases, genetic mutations and intake of medicines and drugs. The corresponding study is called gene expression profiling. Although lots of gene expression data have been collected and deposited [7, 8], whole genome profiling is still too expensive for broad use since it requires the collection of data with a large number of genes through various conditions. For example, The initial phase of the CMap project produced only 564 genome-wide gene expression profiles [9]. One of the solutions to reduce the expense of whole genome profiling is to utilize the high correlation among different genes [10] and select a group of genes to represent overall genome expression. Researchers from the LINCS program performed principal components analysis(PCA) and found that 1,000 carefully chosen genes(named landmark genes) were sufficient to recover PLOS ONE | https://doi.org/10.1371/journal.pone.0281286 February 6, 2023 1 / 18 PLOS ONE design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Gene expression prediction utilizing gene interaction graphs 80% of the information in the whole genome [11]. Then they developed the L1000 Luminex bead technology to measure the expression profiles of these 1000 genes at a much lower cost. Lots of literature have been proposed then based on this cost-effective strategy [12, 13]. Despite the low cost of the L1000 program, one of the natural questions is how to infer other genes, named target genes, based on these landmark genes. The original paper proposed by the LINCS program adopts simple linear regression. Although classic and computationally efficient, linear regression can not capture the nonlinear relationship between landmark genes and target genes. With the development of deep learning methods, Li et al. [10] proposed a full connection neural netword-based method D-GEX and achieved better results than linear regression in both DNA-microarray and RNA sequencing data. Although D-GEX performs much better than traditional methods, it may be further improved. D-GEX uses full connection neural network model which implicitly assuames that landmark genes are interchangeable. In other words, the landmark gene expression data can be fed into the full connection neural network in any order without affecting the final result. The motivation of this paper comes from considering whether this implicit assumption holds for the gene expression data. As shown in many biology studies [14, 15], the genes have an inherent structure, at the same time, cells can coordinate the regulation of many genes at once. Thus, the D-GEX model neglects the latent structure of the landmark genes, and it is beneficial to incorporate exterior information which gives the structure of genes into the deep learning method. The gene interaction graph, which depicts such coordination by giving functional biological interaction between two genes is a perfect candidate. In the gene interaction graph, nodes represent genes, and edges represent the functional biological interaction between two genes. There have been many gene interaction graphs constructed from different molecular levels(Szklarczyk et al. [16]; Warde-Farley et al. [17];): protein-protein interaction, transcription factors, and gene co-expression are the common material to construct gene interaction graph. Another aspect is that in the deep learning literature, the processing of graph data has recently drawn a major interest [18]. Any neural network working on the graph data can be categorized as a graph neural network(GNN). In particular, graph convolutional network (GCN) has been a predominant approach [19] among the graph neural networks. In this paper, we will briefly introduce the classical graph convolutional network structure and then compare the GCN w (...truncated)