A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks
Xiang et al. BMC Systems Biology 2013, 7(Suppl 3):S9
http://www.biomedcentral.com/1752-0509/7/S3/S9
RESEARCH
Open Access
A genome-wide MeSH-based literature mining
system predicts implicit gene-to-gene
relationships and networks
Zuoshuang Xiang1,2,3,4, Tingting Qin5, Zhaohui S Qin6,7, Yongqun He1,2,3,4*
From Asia Pacific Bioinformatics Network (APBioNet) Twelfth International Conference on Bioinformatics
(InCoB2013)
Taicang China. 20-22 September 2013
Abstract
Background: The large amount of literature in the post-genomics era enables the study of gene interactions and
networks using all available articles published for a specific organism. MeSH is a controlled vocabulary of medical and
scientific terms that is used by biomedical scientists to manually index articles in the PubMed literature database.
We hypothesized that genome-wide gene-MeSH term associations from the PubMed literature database could be used
to predict implicit gene-to-gene relationships and networks. While the gene-MeSH associations have been used to
detect gene-gene interactions in some studies, different methods have not been well compared, and such a strategy has
not been evaluated for a genome-wide literature analysis. Genome-wide literature mining of gene-to-gene interactions
allows ranking of the best gene interactions and investigation of comprehensive biological networks at a genome level.
Results: The genome-wide GenoMesh literature mining algorithm was developed by sequentially generating a
gene-article matrix, a normalized gene-MeSH term matrix, and a gene-gene matrix. The gene-gene matrix relies on
the calculation of pairwise gene dissimilarities based on gene-MeSH relationships. An optimized dissimilarity score
was identified from six well-studied functions based on a receiver operating characteristic (ROC) analysis. Based on
the studies with well-studied Escherichia coli and less-studied Brucella spp., GenoMesh was found to accurately
identify gene functions using weighted MeSH terms, predict gene-gene interactions not reported in the literature,
and cluster all the genes studied from an organism using the MeSH-based gene-gene matrix. A web-based
GenoMesh literature mining program is also available at: http://genomesh.hegroup.org. GenoMesh also predicts
gene interactions and networks among genes associated with specific MeSH terms or user-selected gene lists.
Conclusions: The GenoMesh algorithm and web program provide the first genome-wide, MeSH-based literature
mining system that effectively predicts implicit gene-gene interaction relationships and networks in a genomewide scope.
Background
Biological systems are complex and involve various
interactions and pathways among genes and gene products. To understand the involvement of underlying
mechanism(s), exploring and defining complex relationships among genes in a genome is essential. Many types
* Correspondence:
1
Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI,
USA
Full list of author information is available at the end of the article
of relationships exist such as physical interactions
between two proteins and regulatory interactions
between multiple genes. Such gene-to-gene relationships
can be found in the biomedical literature. The bibliographic database MEDLINE that can be queried through
PubMed [1] contains over 20 million references of journal articles in the life sciences. Over 2,000-4,000 new
entries are added daily. Each indexed article in MEDLINE is summarized in the form of manually curated
Medical Subject Headings (MeSH) terms [2]. MeSH is a
© 2013 Xiang et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Xiang et al. BMC Systems Biology 2013, 7(Suppl 3):S9
http://www.biomedcentral.com/1752-0509/7/S3/S9
controlled vocabulary of medical and scientific terms for
indexing articles in the PubMed literature database. The
2013 MeSH contains 26,853 MeSH descriptors organized in a hierarchal fashion based on 16 top-level categories. Over 213,000 MeSH entry terms also exist to
assist in finding the most appropriate MeSH Headings
[3]. All the MeSH terms are assigned to individual
PubMed articles manually by knowledgeable biomedical
scientists. The terminology used in MeSH provides a
unique and consistent approach to retrieve information
that uses different terminologies to describe similar biological and/or medical concepts.
Several approaches have been used to explore the
gene-to-gene relationships and pathways reported in the
literature. A common and direct strategy is to check
gene co-occurrence [4,5]. Two genes may be related if
they are listed in the same publication, particularly if
listed in the same title, abstract, or sentence. For example, the PubGene system extracts gene relationships
based on co-occurrence of gene symbols in MEDLINE
titles and abstracts [5]. The PubGene co-occurrence networks display possible relationships between terms and
facilitate medical literature retrieval for relevant articles
implied by the network display. However, one limitation
of this method is its inability to reveal direct unknown
relationships among genes. Another strategy for identifying related gene pairs from the literature is to infer gene
relatedness based on a common linkage to keywords. Classifications and relatedness from the co-occurrence matrix
of gene names by key terms (e.g. MeSH or Gene Ontology
terms) can be used to identify related gene pairs that have
not been described in the title(s) or abstract(s) of any publication. This approach may be used to study co-citation
and non co-citation relationships. For instance, Masys
et al [6] developed a HAPI system to compare sets of
genes associated with medical conditions based on the
(gene names × MeSH terms) matrix. Similar methods
include ARROWSMITH [7], MeSHmap [8], PubMatrix
[9], and vector space modeling [10,11]. The ability to predict indirect associations among biological entities is a key
feature in the linking of gene names to key terms [12,13].
However, the MeSH-based indirect approaches to infer
gene-gene interactions have not been used previously for a
genome-wide literature analysis. Furthermore, different
methods have not been well compared. A genome-wide
literature mining of gene-to-gene interactions allows
ranking of the best gene interactions and investigation of
comprehensive biological networks at a genome level.
Advantages of a genome-wide approach in gene network
analysis have been proven by numerous high throughput
microarray experiments and data modeling [14].
Recently, a genome-level literature mining method has
been developed by Tsoi et al. [15] to characterize
human genes by Gene Ontology (GO) terms [16], i.e.,
Page 2 of 15
the Ontology Fingerprint. The Ontolo (...truncated)