Using cited references to improve the retrieval of related biomedical documents

BMC Bioinformatics, Mar 2013

Background A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references. Results Data on cited references and text sections in 249,108 full-text biomedical articles was extracted from the Open Access subset of the PubMed Central® database (PMC-OA). Of the five standard sections of a scientific article, the Introduction and Discussion sections contained most of the citations (mean = 10.2 and 9.9 citations, respectively). A large proportion of articles (98.4%) and their cited references (79.5%) were indexed in the PubMed® database. Using the MedlineRanker abstract classification tool, cited references allowed accurate retrieval of the citing document in a test set of 10,000 documents and also of documents related to six biomedical topics defined by particular MeSH® terms from the entire PMC-OA (p-value<0.01). Classification performance was sensitive to the topic and also to the text sections from which the references were selected. Classifiers trained on the baseline (i.e., only text from the query document and not from the references) were outperformed in almost all the cases. Best performance was often obtained when using all cited references, though using the references from Introduction and Discussion sections led to similarly good results. This query expansion method performed significantly better than pseudo relevance feedback in 4 out of 6 topics. Conclusions The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://www.biomedcentral.com/content/pdf/1471-2105-14-113.pdf

Using cited references to improve the retrieval of related biomedical documents

BMC Bioinformatics Using cited references to improve the retrieval of related biomedical documents Francisco M Ortuo 2 Ignacio Rojas 2 Miguel A Andrade-Navarro 0 1 Jean-Fred Fontaine 0 1 0 Computational Biology and Data Mining, Max Delbruck Center for Molecular Medicine , Robert-Rossle-Str. 10, 13125 Berlin , Germany 1 Computational Biology and Data Mining, Max Delbruck Center for Molecular Medicine , Robert-Rossle-Str. 10, 13125 Berlin , Germany 2 Computer Architecture and Computer Technology Department, University of Granada, C/ Periodista Daniel Saucedo Aranda S/N , 18071 Granada , Spain Background: A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references. Conclusions: The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability. Information retrieval; Text categorization; Citations; Full-text documents; Biomedical literature; Query expansion; Document classification - Background Retrieving information from the biomedical literature involves the identification and analysis of documents from millions indexed in public databases such as PubMed [1]. The size of this widely used database has a negative impact on the relevance of users query results; simple free-text queries would return many false positives. Additionally, when reading a document of interest, users can query for related documents. Query expansion or reformulation is used to improve retrieval of documents relevant to a free-text query or related to a document of interest. Various query expansion or reformulation strategies have been proposed in the biomedical or genomics field [2-5]. A users free-text query defining the need for some information can be enriched with common synonyms or morphological variants from existing or automatically generated thesauruses, terms can be weighted, and can also be corrected for spelling errors. By default in PubMed, free-text queries are reformulated with Medical Subject Headings (MeSH) terms. The MeSH thesaurus is a biomedical controlled vocabulary used for manual indexing and searching PubMed. Relevance feedback methods involve the user in selecting relevant documents from results of an initial query in order to reformulate it, and pseudo relevance feedback (PRF) methods consider the top documents returned by the initial query as relevant in order to reformulate the query, avoiding additional user interaction [6]. Alternatively, content similarity algorithms are used to compare biomedical documents. When applied on freely available abstracts in PubMed, such algorithms use words, as well as other features available in indexed abstracts (e.g. authors list, journal title, and MeSH terms) or features processed by specific algorithms (e.g. part of speech, semantic processing) [7-12]. However, when a single document is used as input (as for the PubMed Related Articles (PMRA) algorithm used to display a list of related documents in PubMed [13]), its abstract might not have enough content to allow proper retrieval. Using the full text offers one possibility for expanding the information related to one document, and is increasingly used as more full text manuscripts become available from large resources such as the PubMed Central (PMC) database and its Open Access subset (PMC-OA) [4,14]. Another possibility is given by the references associated to the article by citation: either cited documents or documents citing it. For a given scientific document, finding the cited references is straightforward since they are usually listed in a dedicated section. In contrast, finding its referring citations requires mining all existing scientific documents, which might be impractical. Using related references by citation has been already used for classification of documents. For example, it was shown that algorithms based on shared references or citations can outperform text-based algorithms in a digital library of computer science papers [15]. Papers were compared using three bibliometric similarity measures: co-citation (based on the number of citing documents in common) [16], bibliographic coupling (based on the nu (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2105-14-113.pdf

Francisco M Ortuño, Ignacio Rojas, Miguel A Andrade-Navarro, Jean-Fred Fontaine. Using cited references to improve the retrieval of related biomedical documents, BMC Bioinformatics, 2013, pp. 113, 14, DOI: 10.1186/1471-2105-14-113