A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks

Oct 2013

The large amount of literature in the post-genomics era enables the study of gene interactions and networks using all available articles published for a specific organism. MeSH is a controlled vocabulary of medical and scientific terms that is used by biomedical scientists to manually index articles in the PubMed literature database. We hypothesized that genome-wide gene-MeSH term associations from the PubMed literature database could be used to predict implicit gene-to-gene relationships and networks. While the gene-MeSH associations have been used to detect gene-gene interactions in some studies, different methods have not been well compared, and such a strategy has not been evaluated for a genome-wide literature analysis. Genome-wide literature mining of gene-to-gene interactions allows ranking of the best gene interactions and investigation of comprehensive biological networks at a genome level. The genome-wide GenoMesh literature mining algorithm was developed by sequentially generating a gene-article matrix, a normalized gene-MeSH term matrix, and a gene-gene matrix. The gene-gene matrix relies on the calculation of pairwise gene dissimilarities based on gene-MeSH relationships. An optimized dissimilarity score was identified from six well-studied functions based on a receiver operating characteristic (ROC) analysis. Based on the studies with well-studied Escherichia coli and less-studied Brucella spp., GenoMesh was found to accurately identify gene functions using weighted MeSH terms, predict gene-gene interactions not reported in the literature, and cluster all the genes studied from an organism using the MeSH-based gene-gene matrix. A web-based GenoMesh literature mining program is also available at: http://genomesh.hegroup.org. GenoMesh also predicts gene interactions and networks among genes associated with specific MeSH terms or user-selected gene lists. The GenoMesh algorithm and web program provide the first genome-wide, MeSH-based literature mining system that effectively predicts implicit gene-gene interaction relationships and networks in a genome-wide scope.

Article PDF cannot be displayed. You can download it here:

https://bmcsystbiol.biomedcentral.com/track/pdf/10.1186/1752-0509-7-S3-S9

A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks

Xiang et al. BMC Systems Biology 2013, 7(Suppl 3):S9 http://www.biomedcentral.com/1752-0509/7/S3/S9 RESEARCH Open Access A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks Zuoshuang Xiang1,2,3,4, Tingting Qin5, Zhaohui S Qin6,7, Yongqun He1,2,3,4* From Asia Pacific Bioinformatics Network (APBioNet) Twelfth International Conference on Bioinformatics (InCoB2013) Taicang China. 20-22 September 2013 Abstract Background: The large amount of literature in the post-genomics era enables the study of gene interactions and networks using all available articles published for a specific organism. MeSH is a controlled vocabulary of medical and scientific terms that is used by biomedical scientists to manually index articles in the PubMed literature database. We hypothesized that genome-wide gene-MeSH term associations from the PubMed literature database could be used to predict implicit gene-to-gene relationships and networks. While the gene-MeSH associations have been used to detect gene-gene interactions in some studies, different methods have not been well compared, and such a strategy has not been evaluated for a genome-wide literature analysis. Genome-wide literature mining of gene-to-gene interactions allows ranking of the best gene interactions and investigation of comprehensive biological networks at a genome level. Results: The genome-wide GenoMesh literature mining algorithm was developed by sequentially generating a gene-article matrix, a normalized gene-MeSH term matrix, and a gene-gene matrix. The gene-gene matrix relies on the calculation of pairwise gene dissimilarities based on gene-MeSH relationships. An optimized dissimilarity score was identified from six well-studied functions based on a receiver operating characteristic (ROC) analysis. Based on the studies with well-studied Escherichia coli and less-studied Brucella spp., GenoMesh was found to accurately identify gene functions using weighted MeSH terms, predict gene-gene interactions not reported in the literature, and cluster all the genes studied from an organism using the MeSH-based gene-gene matrix. A web-based GenoMesh literature mining program is also available at: http://genomesh.hegroup.org. GenoMesh also predicts gene interactions and networks among genes associated with specific MeSH terms or user-selected gene lists. Conclusions: The GenoMesh algorithm and web program provide the first genome-wide, MeSH-based literature mining system that effectively predicts implicit gene-gene interaction relationships and networks in a genomewide scope. Background Biological systems are complex and involve various interactions and pathways among genes and gene products. To understand the involvement of underlying mechanism(s), exploring and defining complex relationships among genes in a genome is essential. Many types * Correspondence: 1 Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI, USA Full list of author information is available at the end of the article of relationships exist such as physical interactions between two proteins and regulatory interactions between multiple genes. Such gene-to-gene relationships can be found in the biomedical literature. The bibliographic database MEDLINE that can be queried through PubMed [1] contains over 20 million references of journal articles in the life sciences. Over 2,000-4,000 new entries are added daily. Each indexed article in MEDLINE is summarized in the form of manually curated Medical Subject Headings (MeSH) terms [2]. MeSH is a © 2013 Xiang et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Xiang et al. BMC Systems Biology 2013, 7(Suppl 3):S9 http://www.biomedcentral.com/1752-0509/7/S3/S9 controlled vocabulary of medical and scientific terms for indexing articles in the PubMed literature database. The 2013 MeSH contains 26,853 MeSH descriptors organized in a hierarchal fashion based on 16 top-level categories. Over 213,000 MeSH entry terms also exist to assist in finding the most appropriate MeSH Headings [3]. All the MeSH terms are assigned to individual PubMed articles manually by knowledgeable biomedical scientists. The terminology used in MeSH provides a unique and consistent approach to retrieve information that uses different terminologies to describe similar biological and/or medical concepts. Several approaches have been used to explore the gene-to-gene relationships and pathways reported in the literature. A common and direct strategy is to check gene co-occurrence [4,5]. Two genes may be related if they are listed in the same publication, particularly if listed in the same title, abstract, or sentence. For example, the PubGene system extracts gene relationships based on co-occurrence of gene symbols in MEDLINE titles and abstracts [5]. The PubGene co-occurrence networks display possible relationships between terms and facilitate medical literature retrieval for relevant articles implied by the network display. However, one limitation of this method is its inability to reveal direct unknown relationships among genes. Another strategy for identifying related gene pairs from the literature is to infer gene relatedness based on a common linkage to keywords. Classifications and relatedness from the co-occurrence matrix of gene names by key terms (e.g. MeSH or Gene Ontology terms) can be used to identify related gene pairs that have not been described in the title(s) or abstract(s) of any publication. This approach may be used to study co-citation and non co-citation relationships. For instance, Masys et al [6] developed a HAPI system to compare sets of genes associated with medical conditions based on the (gene names × MeSH terms) matrix. Similar methods include ARROWSMITH [7], MeSHmap [8], PubMatrix [9], and vector space modeling [10,11]. The ability to predict indirect associations among biological entities is a key feature in the linking of gene names to key terms [12,13]. However, the MeSH-based indirect approaches to infer gene-gene interactions have not been used previously for a genome-wide literature analysis. Furthermore, different methods have not been well compared. A genome-wide literature mining of gene-to-gene interactions allows ranking of the best gene interactions and investigation of comprehensive biological networks at a genome level. Advantages of a genome-wide approach in gene network analysis have been proven by numerous high throughput microarray experiments and data modeling [14]. Recently, a genome-level literature mining method has been developed by Tsoi et al. [15] to characterize human genes by Gene Ontology (GO) terms [16], i.e., Page 2 of 15 the Ontology Fingerprint. The Ontolo (...truncated)


This is a preview of a remote PDF: https://bmcsystbiol.biomedcentral.com/track/pdf/10.1186/1752-0509-7-S3-S9
Article home page: https://bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-7-S3-S9

Zuoshuang Xiang, Tingting Qin, Zhaohui S Qin, Yongqun He. A genome-wide MeSH-based literature mining system predicts implicit gene-to-gene relationships and networks, 2013, pp. S9, Volume 7, Issue 3, DOI: 10.1186/1752-0509-7-S3-S9