Text categorization models for retrieval of high quality articles in internal medicine. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480096/pdf/

Text categorization models for retrieval of high quality articles in internal medicine.

Text Categorization Models for Retrieval of High Quality Articles in Internal Medicine Y. Aphinyanaphongs, M.S., C.F. Aliferis M.D., Ph.D. Department of Biomedical Informatics, Vanderbilt University, Nashville, TN Abstract The discipline of Evidence Based Medicine (EBM) studies formal and quasi-formal methods for identifying high quality medical information and abstracting it in useful forms so that patients receive the best customized care possible [1]. Current computer-based methods for finding high quality information in PubMed and similar bibliographic resources utilize search tools that employ preconstructed Boolean queries. These clinical queries are derived from a combined application of (a) user interviews, (b) ad-hoc manual document quality review, and (c) search over a constrained space of disjunctive Boolean queries. The present research explores the use of powerful text categorization (machine learning) methods to identify content-specific and high-quality PubMed articles. Our results show that models built with the proposed approach outperform the Boolean based PubMed clinical query filters in discriminatory power. Introduction Evidence Based medicine (EBM) is the clinical application of high-quality medical information. The application of EBM involves 3 distinct steps [2]: (1) the identification of high-quality evidence that pertains to a specific clinical question, (2) evaluation and synthesis of this evidence, and (3) application of the evidence to the problem. This paper addresses the question of how to identify high quality evidence. One pioneering method to identify high quality articles is the use of the "clinical query filter" (CQF) for PubMed article retrieval. Introduced by Haynes et. al, the method involves the creation of boolean set terms which are used to filter and identify high quality articles in pre-specified content areas. These filters have been shown to have good performance [3] and are featured in the clinical queries link in PubMed [4]. This method requires manual selection of terms and relies on a brute-force learning approach using a non-standard and fairly restrictive classifier (term disjunctions of 4 to 5 terms). The motivation of the present paper is to contribute to the practice of EBM by exploring methods to automatically construct quality and content filters for article retrieval. We hypothesize that using powerful text categorization techniques and a suitable article collection for training, we can construct filters superior to the existing ones. Toward these goals, and as a first step, we explore computer models to retrieve high-quality, treatment-related articles in internal medicine. Methods 1. Definitions At the core of our efforts lies the selection of a rigorous quality and content gold standard as well as the creation of a document collection that captures this gold standard. Ideally this gold standard should be easy to obtain for large numbers of documents. For these reasons, we chose to use the selections of the editors and reviewers of the ACP journal club as our gold standard [5]. The ACP journal club is a highly-rated metapublication. It includes no original research articles. Instead, every month experts review the best journals in internal medicine and select the best articles according to specific selection criteria in the article class areas of: treatment, diagnosis, etiology, prognosis, quality improvement, clinical prediction guide, and economics. Selected articles are further subdivided into articles that are summarized and abstracted by the ACP because of their clinical importance, and those that are only cited because they meet all the selection criteria but may not pertain to vitally important clinical areas. (In the present paper, the abstracted or cited articles are denoted as ACP+; all other articles not abstracted or cited as ACP-.) Every article is subjected to rigorous review for inclusion. For example, in the article class area of treatment, the basic criteria are a random allocation of participants to comparison groups, 80% follow-up of those entering the study, and the outcome to be of known or probable clinical importance [5]. For our first experiments, we chose the treatment class area. The ACP journal cites and abstracts this area the most, and a larger proportion of clinical questions are treatment-related [6]. In the discussion section we discuss extensions to all categories. 2. Corpus Construction We downloaded from PubMed all original articles with abstracts from the journals reviewed by the ACP in the publication period of July 1998 through August 1999. Two conditions motivated this period of time. First, one year provides a large sample for the AMIA 2003 Symposium Proceedings − Page 31 treatment category. Second, selecting a period of several years before the start of the present study gave ample time for original articles to be reviewed by the ACP. The ACP journal typically takes several months to review and republish an article. Thus, to ensure that no ACP+ articles are missed, the ACP journal was reviewed from the beginning of the publication period, July 1998 to nearly 1.5 years after the end of the publication period, December 2000. We identified 49 journals appearing in the review lists of the table of contents of the first ACP journal in July 1998 to the last ACP journal in December 2000. This set of journals thus is guaranteed to be the complete set of journals reviewed by the ACP. All original articles were automatically downloaded with custom Python scripts using the limit option of the PubMed search interface. Each search was limited to the title of 1 of the 49 journals and set to only retrieve articles during the publication period. The “only items with abstracts” checkbox was marked to ensure that letters and other content were not included in the results. These articles were downloaded in XML format. A custom built XML parser extracted PubmedID, title, abstract, publication type, and MeSH terms. All article information was stored in a relational database (MySQL) [7]. Reviewing the ACP between July 1998 and December 2000 identified the high quality articles in the publication period of July 1998 to August 1999. Due to the unavailability of complete electronic versions of the ACP for these periods, all table of contents and cited article lists were scanned on a HP Scanjet C9850A and digitized using ABBY FineReader Pro optical character recognition (OCR) software [8]. OCR errors were manually identified and corrected. ACP articles were automatically matched with the titles of articles in the MySQL database and marked in the corpus. In addition, each article was marked as to the article class it belongs. 3. Corpus Preparation For Analysis The corpus was divided into positive and negative classes. The positive class composed of 396 ACP+ articles in the treatment class. The negative class had 15407 total ACP- articles and ACP+ articles not in the treatment class. 20% of (...truncated)