Text categorization models for retrieval of high quality articles in internal medicine.
Text Categorization Models for Retrieval of High Quality Articles in
Internal Medicine
Y. Aphinyanaphongs, M.S., C.F. Aliferis M.D., Ph.D.
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN
Abstract
The discipline of Evidence Based Medicine
(EBM) studies formal and quasi-formal methods for
identifying high quality medical information and
abstracting it in useful forms so that patients receive
the best customized care possible [1]. Current
computer-based methods for finding high quality
information in PubMed and similar bibliographic
resources utilize search tools that employ
preconstructed Boolean queries. These clinical
queries are derived from a combined application of
(a) user interviews, (b) ad-hoc manual document
quality review, and (c) search over a constrained
space of disjunctive Boolean queries. The present
research explores the use of powerful text
categorization (machine learning) methods to identify
content-specific and high-quality PubMed articles.
Our results show that models built with the proposed
approach outperform the Boolean based PubMed
clinical query filters in discriminatory power.
Introduction
Evidence Based medicine (EBM) is the clinical
application of high-quality medical information. The
application of EBM involves 3 distinct steps [2]: (1)
the identification of high-quality evidence that
pertains to a specific clinical question, (2) evaluation
and synthesis of this evidence, and (3) application of
the evidence to the problem. This paper addresses
the question of how to identify high quality evidence.
One pioneering method to identify high quality
articles is the use of the "clinical query filter" (CQF)
for PubMed article retrieval. Introduced by Haynes
et. al, the method involves the creation of boolean set
terms which are used to filter and identify high
quality articles in pre-specified content areas. These
filters have been shown to have good performance [3]
and are featured in the clinical queries link in
PubMed [4]. This method requires manual selection
of terms and relies on a brute-force learning approach
using a non-standard and fairly restrictive classifier
(term disjunctions of 4 to 5 terms).
The motivation of the present paper is to
contribute to the practice of EBM by exploring
methods to automatically construct quality and
content filters for article retrieval. We hypothesize
that using powerful text categorization techniques
and a suitable article collection for training, we can
construct filters superior to the existing ones. Toward
these goals, and as a first step, we explore computer
models to retrieve high-quality, treatment-related
articles in internal medicine.
Methods
1. Definitions
At the core of our efforts lies the selection of a
rigorous quality and content gold standard as well as
the creation of a document collection that captures
this gold standard. Ideally this gold standard should
be easy to obtain for large numbers of documents.
For these reasons, we chose to use the selections of
the editors and reviewers of the ACP journal club as
our gold standard [5].
The ACP journal club is a highly-rated metapublication. It includes no original research articles.
Instead, every month experts review the best journals
in internal medicine and select the best articles
according to specific selection criteria in the article
class areas of: treatment, diagnosis, etiology,
prognosis, quality improvement, clinical prediction
guide, and economics. Selected articles are further
subdivided into articles that are summarized and
abstracted by the ACP because of their clinical
importance, and those that are only cited because
they meet all the selection criteria but may not pertain
to vitally important clinical areas. (In the present
paper, the abstracted or cited articles are denoted as
ACP+; all other articles not abstracted or cited as
ACP-.) Every article is subjected to rigorous review
for inclusion. For example, in the article class area of
treatment, the basic criteria are a random allocation
of participants to comparison groups, 80% follow-up
of those entering the study, and the outcome to be of
known or probable clinical importance [5].
For our first experiments, we chose the treatment
class area. The ACP journal cites and abstracts this
area the most, and a larger proportion of clinical
questions are treatment-related [6]. In the discussion
section we discuss extensions to all categories.
2. Corpus Construction
We downloaded from PubMed all original articles
with abstracts from the journals reviewed by the ACP
in the publication period of July 1998 through August
1999. Two conditions motivated this period of time.
First, one year provides a large sample for the
AMIA 2003 Symposium Proceedings − Page 31
treatment category. Second, selecting a period of
several years before the start of the present study
gave ample time for original articles to be reviewed
by the ACP. The ACP journal typically takes several
months to review and republish an article. Thus, to
ensure that no ACP+ articles are missed, the ACP
journal was reviewed from the beginning of the
publication period, July 1998 to nearly 1.5 years after
the end of the publication period, December 2000.
We identified 49 journals appearing in the review
lists of the table of contents of the first ACP journal
in July 1998 to the last ACP journal in December
2000. This set of journals thus is guaranteed to be
the complete set of journals reviewed by the ACP.
All original articles were automatically
downloaded with custom Python scripts using the
limit option of the PubMed search interface. Each
search was limited to the title of 1 of the 49 journals
and set to only retrieve articles during the publication
period. The “only items with abstracts” checkbox
was marked to ensure that letters and other content
were not included in the results. These articles were
downloaded in XML format. A custom built XML
parser extracted PubmedID, title, abstract, publication
type, and MeSH terms. All article information was
stored in a relational database (MySQL) [7].
Reviewing the ACP between July 1998 and
December 2000 identified the high quality articles in
the publication period of July 1998 to August 1999.
Due to the unavailability of complete electronic
versions of the ACP for these periods, all table of
contents and cited article lists were scanned on a HP
Scanjet C9850A and digitized using ABBY
FineReader Pro optical character recognition (OCR)
software [8]. OCR errors were manually identified
and corrected. ACP articles were automatically
matched with the titles of articles in the MySQL
database and marked in the corpus. In addition, each
article was marked as to the article class it belongs.
3. Corpus Preparation For Analysis
The corpus was divided into positive and negative
classes. The positive class composed of 396 ACP+
articles in the treatment class. The negative class had
15407 total ACP- articles and ACP+ articles not in
the treatment class. 20% of (...truncated)