A Computational Study Identifies HIV Progression-Related Genes Using mRMR and Shortest Path Tracing
Citation: Ma C, Dong X, Li R, Liu L (
A Computational Study Identifies HIV Progression-Related Genes Using mRMR and Shortest Path Tracing
Chengcheng Ma 0
Xiao Dong 0
Rudong Li 0
Lei Liu 0
Yuntao Wu, George Mason University, United States of America
0 1 Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences , Shanghai , P.R. China , 2 University of Chinese Academy of Sciences , Beijing , P.R. China , 3 Shanghai Center for Bioinformation Technology , Shanghai , P.R. China , 4 Institutes for Biomedical Sciences, Fudan University , Shanghai , P.R. China
Since statistical relationships between HIV load and CD4+ T cell loss have been demonstrated to be weak, searching for host factors contributing to the pathogenesis of HIV infection becomes a key point for both understanding the disease pathology and developing treatments. We applied Maximum Relevance Minimum Redundancy (mRMR) algorithm to a set of microarray data generated from the CD4+ T cells of viremic non-progressors (VNPs) and rapid progressors (RPs) to identify host factors associated with the different responses to HIV infection. Using mRMR algorithm, 147 gene had been identified. Furthermore, we constructed a weighted molecular interaction network with the existing protein-protein interaction data from STRING database and identified 1331 genes on the shortest-paths among the genes identified with mRMR. Functional analysis shows that the functions relating to apoptosis play important roles during the pathogenesis of HIV infection. These results bring new insights of understanding HIV progression.
-
Funding: This work was funded by a Ministry of Science and Technology Grant of P.R. China. No. 2012AA02A602. The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
. These authors contributed equally to this work.
Many efforts have been devoted to better understanding the
mechanism governing disease progression and non-progression
during HIV infection. Besides the direct cytotoxic effect against
CD4+ T cells caused by HIV, immune activation is widely
accepted as a good predictor for disease progression [1,2,3,4,5].
Furthermore, some clinical studies using immunity suppressive
drugs to suppress immune activation slowed down the diseases
progression [6,7]. However, the molecular mechanism underlying
the immunopathogenesis remains obscure. To identify the host
factors important for the HIV-1 pathogenesis and disease
progression, high throughput techniques had been employed.
Genome wide association studies revealed the protective effect
against the virus of human leukocyte antigens (HLAs) including
HLA-B*57:01, B*27:05 and risk alleles including HLA-B*35,
Cw*07 [8,9]. These studies further led to the finding of the
protective effect of HLA-C [10] Transcriptome studies also gained
important insights, regarding to interferon stimulated genes (ISGs),
immune activation, cell cycle and cell death during the infection.
People had conducted transcriptome researches for identifying
factors affect the viral control and the speed of CD4+ T cell loss
[11,12]. A research with 137 HIV seroconverters, 16 elite
controllers and 3 healthy blood donors attempted to identify
some molecular factors associated with the viral control. More
surprisingly, successful treatment made the transcriptome states of
patients similar to the elite controllers and the HIV-negative
donors [11]. Another study compared the transcriptoms of 6
viremic non-progressors (VNPs) and more than 20 rapid
progressors (RPs). No significant result was found. Genes identified
from the data of monkeys (CASP1, CD38, LAG3, SOCS1, EEIFD,
and TNFSF13B) were deemed as the factors affecting the speed of
disease progression [12].
Machine learning was recently proved to be an effective strategy
for accurate classification of phenotypes based on transcriptome
data (gene expression microarray) [13,14,15,16,17]. Among them,
minimum redundancy maximum relevance method (mRMR) is
robust and represents a broad spectrum of characteristics [18,19].
It was also developed to identify disease-related genes from
expression profiles [18,19].
Another useful informatics strategy for disease candidate gene
identification is by known protein-protein interactions (PPIs).
Since proteins not only function individually by themselves, but
also co-function with their interaction partners; thus interaction
partners of disease related genes are also important candidates for
further disease casual studies. The STRING (Search Tool for the
Retrieval of Interacting Genes) database is an online resource that
provides PPI information by reporting from both prediction and
experimental observations [20].
Here, we present a comprehensive informatics study based on
transcriptional profiling of three different groups of HIV patients
rapid progressors (RPs), viremic controllers (ECs) and viremic
nonprogressors (VNPs). We attempted to i) identify a gene set
which can well classify the three groups, by using mRMR feature
selection; ii) provide candidate casual genes for further
experimental studies, by using shortest-path analysis of the above
Materials and Methods
Gene expression profiling dataset of HIV patients
The dataset was from a research on HIV infection done by
Rotger et al. [12]. In total, 78 chips were used in that research. We
adopted data generated from CD4+ T cells, which contains 40
microarrays (8 elite controllers (ECs), 27 rapid progressors (RPs), 5
viremic non-progressors (VNPs)). Using the dataset alone, Rotger
et al., didnt observe any differentially expressed genes. The data
was downloaded from NCBI Gene Expression Omnibus (GEO)
with the accession number of GSE28128. The expression profile
was generated using the microarray Illumina HumanWG-6 v3.0
expression beadchip. Bead summary data was the output from
Illuminas BeadStudio software without background correction.
Genes declared as non-expressed (P.0.01) were excluded from
further analysis. Data preprocessing, including quantile
normalization and log2 transformation was completed in the Partek
Genomics Suite package (Partek Inc.).
Minimum redundancy maximum relevance algorithm
Minimum redundancy maximum relevance algorithm for
selecting features (genes) was developed based on the idea to
balance features relevance to target (phenotype) and
redundancy between features [18]. Both relevance and redundancy are
quantified using mutual information (MI). In this study, mRMR
was realized using a R package mRMRe [21], in which MI is
estimated as,
q~I xj,y {
where I and r represent the MI and the correlation coefficient
between variables x and y, respectively.
Let y and X = {x1, , xn} be the input variable (phenotype) and
set of input features (genes), respectively. Given xi as the feature
with highest MI with the output variable, so the (...truncated)