Finding related sentence pairs in MEDLINE
Larry H. Smith
0
W. John Wilbur
0
0
L. H. Smith (&) W. J. Wilbur Computational Biology Branch, National Center for Biotechnology Information
, Building 38A, 8600 Rockville Pike,
Bethesda, MD 20894, USA
We explore the feasibility of automatically identifying sentences in different MEDLINE abstracts that are related in meaning. We compared traditional vector space models with machine learning methods for detecting relatedness, and found that machine learning was superior. The Huber method, a variant of Support Vector Machines which minimizes the modified Huber loss function, achieves 73% precision when the score cutoff is set high enough to identify about one related sentence per abstract on average. We illustrate how an abstract viewed in PubMed might be modified to present the related sentences found in other abstracts by this automatic procedure. Search engines respond to a user query by producing a list of documents from a given collection, ordering the list according to the user's supposed information need. However, even the most relevant documents will contain some portions of greater interest to the user, and other portions of little or no interest. This may explain why for example, when querying over full text collections, retrieval performance can be improved by segmenting documents into sections or paragraphs, and matching or retrieving passages rather than full documents (Hearst and Plaunt 1993; Lin 2009), although mixed results have been reported for matching queries at the sentence level (Ko et al. 2002; Lu et al. 2009; Salton and Buckley 1991). PubMed1 is the search engine for articles in MEDLINE maintained at the National Library of Medicine (Sayers et al. 2009; Wilbur 2005). When a user selects an article to view, a list of related articles may also appear alongside its other details. Related articles 1 http://www.ncbi.nlm.nih.gov/pubmed/.
-
are pre-computed using a topic-based model that measures the size of overlapping subject
matter of two articles (Lin and Wilbur 2007). Although the related articles feature is
popular (20% of user sessions involve viewing a related article) it assumes that users are
primarily interested in articles that have maximal overlapping subject matter. The goal of
this paper is to explore an alternative method of finding related content, to address the
needs of users interested in a particular sentence in an article by finding other related
sentences. We assume users would be interested in other occurrences of the same sentence,
a restatement of the sentence, or any sentence that makes a closely related assertion.
We used vector space models to estimate the relatedness of sentences. In addition to
the tf-idf formula suggested by (Salton and Buckley 1991), we also adapted several
well known retrieval functions, such as the Dice coefficient, cosine similarity, and
bm25. But fixed formulas give only one possibility for term weights in a vector space
model, and theoretically it should be possible to use machine learning to find optimal
term weights.
Machine learning has been applied in an analogous setting of information retrieval. The
goal of learning to rank is to use machine learning to obtain retrieval scoring functions for
optimal ranking of query results (Joachims et al. 2007). That research has been limited in
the past by the availability of test data, and much of the effort has been focused on
effective learning algorithms to meet the unique challenges. The focus may shift now that
the LETOR corpus has emerged as a community benchmark dataset (Liu et al. 2007), and
with methods for automated annotation derived from user clickthrough data (Joachims
2002).
As with learning to rank, the biggest challenge to machine learning of related sentences
is the availability of a usable corpus. To our knowledge, no datasets of related sentences
have been discussed in the research literature. We claim that a productive corpus must be
large enough to contain many examples of related sentences on a variety of different
topics, and that manual annotation of such a large corpus of sentence pairs is not feasible.
Fortunately, there is an ideal solution to this problem. Firstly, the MEDLINE database
contains a large number of sentences (available in article abstracts) in many different
subject areas. And secondly, sentences are likely to be related if they are adjacent
sentences from the same MEDLINE abstract, and unrelated if they are from different
randomly selected abstracts. Thus it is possible to automatically assemble a very large training
corpus of sentence pairs from MEDLINE.
Our ability to detect the relatedness of two sentences is dependent on their sharing
words or parts of words. We do not use a thesaurus or dictionary. We have found words
and portions of words to be the most useful features in our approach. Such features are used
in our application of the standard information retrieval formulas that we test, as well as in
our machine learning. In a minor depart (...truncated)