A Computational Study Identifies HIV Progression-Related Genes Using mRMR and Shortest Path Tracing (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0078057&type=printable

A Computational Study Identifies HIV Progression-Related Genes Using mRMR and Shortest Path Tracing

Citation: Ma C, Dong X, Li R, Liu L ( A Computational Study Identifies HIV Progression-Related Genes Using mRMR and Shortest Path Tracing Chengcheng Ma 0 Xiao Dong 0 Rudong Li 0 Lei Liu 0 Yuntao Wu, George Mason University, United States of America 0 1 Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai , P.R. China , 2 University of Chinese Academy of Sciences , Beijing , P.R. China , 3 Shanghai Center for Bioinformation Technology , Shanghai , P.R. China , 4 Institutes for Biomedical Sciences, Fudan University , Shanghai , P.R. China Since statistical relationships between HIV load and CD4+ T cell loss have been demonstrated to be weak, searching for host factors contributing to the pathogenesis of HIV infection becomes a key point for both understanding the disease pathology and developing treatments. We applied Maximum Relevance Minimum Redundancy (mRMR) algorithm to a set of microarray data generated from the CD4+ T cells of viremic non-progressors (VNPs) and rapid progressors (RPs) to identify host factors associated with the different responses to HIV infection. Using mRMR algorithm, 147 gene had been identified. Furthermore, we constructed a weighted molecular interaction network with the existing protein-protein interaction data from STRING database and identified 1331 genes on the shortest-paths among the genes identified with mRMR. Functional analysis shows that the functions relating to apoptosis play important roles during the pathogenesis of HIV infection. These results bring new insights of understanding HIV progression. - Funding: This work was funded by a Ministry of Science and Technology Grant of P.R. China. No. 2012AA02A602. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. . These authors contributed equally to this work. Many efforts have been devoted to better understanding the mechanism governing disease progression and non-progression during HIV infection. Besides the direct cytotoxic effect against CD4+ T cells caused by HIV, immune activation is widely accepted as a good predictor for disease progression [1,2,3,4,5]. Furthermore, some clinical studies using immunity suppressive drugs to suppress immune activation slowed down the diseases progression [6,7]. However, the molecular mechanism underlying the immunopathogenesis remains obscure. To identify the host factors important for the HIV-1 pathogenesis and disease progression, high throughput techniques had been employed. Genome wide association studies revealed the protective effect against the virus of human leukocyte antigens (HLAs) including HLA-B*57:01, B*27:05 and risk alleles including HLA-B*35, Cw*07 [8,9]. These studies further led to the finding of the protective effect of HLA-C [10] Transcriptome studies also gained important insights, regarding to interferon stimulated genes (ISGs), immune activation, cell cycle and cell death during the infection. People had conducted transcriptome researches for identifying factors affect the viral control and the speed of CD4+ T cell loss [11,12]. A research with 137 HIV seroconverters, 16 elite controllers and 3 healthy blood donors attempted to identify some molecular factors associated with the viral control. More surprisingly, successful treatment made the transcriptome states of patients similar to the elite controllers and the HIV-negative donors [11]. Another study compared the transcriptoms of 6 viremic non-progressors (VNPs) and more than 20 rapid progressors (RPs). No significant result was found. Genes identified from the data of monkeys (CASP1, CD38, LAG3, SOCS1, EEIFD, and TNFSF13B) were deemed as the factors affecting the speed of disease progression [12]. Machine learning was recently proved to be an effective strategy for accurate classification of phenotypes based on transcriptome data (gene expression microarray) [13,14,15,16,17]. Among them, minimum redundancy maximum relevance method (mRMR) is robust and represents a broad spectrum of characteristics [18,19]. It was also developed to identify disease-related genes from expression profiles [18,19]. Another useful informatics strategy for disease candidate gene identification is by known protein-protein interactions (PPIs). Since proteins not only function individually by themselves, but also co-function with their interaction partners; thus interaction partners of disease related genes are also important candidates for further disease casual studies. The STRING (Search Tool for the Retrieval of Interacting Genes) database is an online resource that provides PPI information by reporting from both prediction and experimental observations [20]. Here, we present a comprehensive informatics study based on transcriptional profiling of three different groups of HIV patients rapid progressors (RPs), viremic controllers (ECs) and viremic nonprogressors (VNPs). We attempted to i) identify a gene set which can well classify the three groups, by using mRMR feature selection; ii) provide candidate casual genes for further experimental studies, by using shortest-path analysis of the above Materials and Methods Gene expression profiling dataset of HIV patients The dataset was from a research on HIV infection done by Rotger et al. [12]. In total, 78 chips were used in that research. We adopted data generated from CD4+ T cells, which contains 40 microarrays (8 elite controllers (ECs), 27 rapid progressors (RPs), 5 viremic non-progressors (VNPs)). Using the dataset alone, Rotger et al., didnt observe any differentially expressed genes. The data was downloaded from NCBI Gene Expression Omnibus (GEO) with the accession number of GSE28128. The expression profile was generated using the microarray Illumina HumanWG-6 v3.0 expression beadchip. Bead summary data was the output from Illuminas BeadStudio software without background correction. Genes declared as non-expressed (P.0.01) were excluded from further analysis. Data preprocessing, including quantile normalization and log2 transformation was completed in the Partek Genomics Suite package (Partek Inc.). Minimum redundancy maximum relevance algorithm Minimum redundancy maximum relevance algorithm for selecting features (genes) was developed based on the idea to balance features relevance to target (phenotype) and redundancy between features [18]. Both relevance and redundancy are quantified using mutual information (MI). In this study, mRMR was realized using a R package mRMRe [21], in which MI is estimated as, q~I xj,y { where I and r represent the MI and the correlation coefficient between variables x and y, respectively. Let y and X = {x1, , xn} be the input variable (phenotype) and set of input features (genes), respectively. Given xi as the feature with highest MI with the output variable, so the (...truncated)