Finding related sentence pairs in MEDLINE

Information Retrieval Journal, Jan 2010

We explore the feasibility of automatically identifying sentences in different MEDLINE abstracts that are related in meaning. We compared traditional vector space models with machine learning methods for detecting relatedness, and found that machine learning was superior. The Huber method, a variant of Support Vector Machines which minimizes the modified Huber loss function, achieves 73% precision when the score cutoff is set high enough to identify about one related sentence per abstract on average. We illustrate how an abstract viewed in PubMed might be modified to present the related sentences found in other abstracts by this automatic procedure.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10791-010-9126-8.pdf

Finding related sentence pairs in MEDLINE

Larry H. Smith 0 W. John Wilbur 0 0 L. H. Smith (&) W. J. Wilbur Computational Biology Branch, National Center for Biotechnology Information , Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA We explore the feasibility of automatically identifying sentences in different MEDLINE abstracts that are related in meaning. We compared traditional vector space models with machine learning methods for detecting relatedness, and found that machine learning was superior. The Huber method, a variant of Support Vector Machines which minimizes the modified Huber loss function, achieves 73% precision when the score cutoff is set high enough to identify about one related sentence per abstract on average. We illustrate how an abstract viewed in PubMed might be modified to present the related sentences found in other abstracts by this automatic procedure. Search engines respond to a user query by producing a list of documents from a given collection, ordering the list according to the user's supposed information need. However, even the most relevant documents will contain some portions of greater interest to the user, and other portions of little or no interest. This may explain why for example, when querying over full text collections, retrieval performance can be improved by segmenting documents into sections or paragraphs, and matching or retrieving passages rather than full documents (Hearst and Plaunt 1993; Lin 2009), although mixed results have been reported for matching queries at the sentence level (Ko et al. 2002; Lu et al. 2009; Salton and Buckley 1991). PubMed1 is the search engine for articles in MEDLINE maintained at the National Library of Medicine (Sayers et al. 2009; Wilbur 2005). When a user selects an article to view, a list of related articles may also appear alongside its other details. Related articles 1 http://www.ncbi.nlm.nih.gov/pubmed/. - are pre-computed using a topic-based model that measures the size of overlapping subject matter of two articles (Lin and Wilbur 2007). Although the related articles feature is popular (20% of user sessions involve viewing a related article) it assumes that users are primarily interested in articles that have maximal overlapping subject matter. The goal of this paper is to explore an alternative method of finding related content, to address the needs of users interested in a particular sentence in an article by finding other related sentences. We assume users would be interested in other occurrences of the same sentence, a restatement of the sentence, or any sentence that makes a closely related assertion. We used vector space models to estimate the relatedness of sentences. In addition to the tf-idf formula suggested by (Salton and Buckley 1991), we also adapted several well known retrieval functions, such as the Dice coefficient, cosine similarity, and bm25. But fixed formulas give only one possibility for term weights in a vector space model, and theoretically it should be possible to use machine learning to find optimal term weights. Machine learning has been applied in an analogous setting of information retrieval. The goal of learning to rank is to use machine learning to obtain retrieval scoring functions for optimal ranking of query results (Joachims et al. 2007). That research has been limited in the past by the availability of test data, and much of the effort has been focused on effective learning algorithms to meet the unique challenges. The focus may shift now that the LETOR corpus has emerged as a community benchmark dataset (Liu et al. 2007), and with methods for automated annotation derived from user clickthrough data (Joachims 2002). As with learning to rank, the biggest challenge to machine learning of related sentences is the availability of a usable corpus. To our knowledge, no datasets of related sentences have been discussed in the research literature. We claim that a productive corpus must be large enough to contain many examples of related sentences on a variety of different topics, and that manual annotation of such a large corpus of sentence pairs is not feasible. Fortunately, there is an ideal solution to this problem. Firstly, the MEDLINE database contains a large number of sentences (available in article abstracts) in many different subject areas. And secondly, sentences are likely to be related if they are adjacent sentences from the same MEDLINE abstract, and unrelated if they are from different randomly selected abstracts. Thus it is possible to automatically assemble a very large training corpus of sentence pairs from MEDLINE. Our ability to detect the relatedness of two sentences is dependent on their sharing words or parts of words. We do not use a thesaurus or dictionary. We have found words and portions of words to be the most useful features in our approach. Such features are used in our application of the standard information retrieval formulas that we test, as well as in our machine learning. In a minor depart (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs10791-010-9126-8.pdf

Larry H. Smith, W. John Wilbur. Finding related sentence pairs in MEDLINE, Information Retrieval Journal, 2010, pp. 601-617, Volume 13, Issue 6, DOI: 10.1007/s10791-010-9126-8