Automatic section segmentation of medical reports. (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1479978/pdf/

Automatic section segmentation of medical reports.

Automatic Section Segmentation of Medical Reports Paul S. Cho, Ph.D.,1 Ricky K. Taira, Ph.D., 2 and Hooshang Kangarloo, M.D. 2 1 Department of Radiation Oncology, University of Washington, Seattle, WA 2 Medical Informatics Group, University of California, Los Angeles, CA Abstract Automated segmentation of medical reports can significantly enhance the productivity of the healthcare departments. While many algorithms have been developed for document summarization, passage retrieval, and story segmentation of news feeds, much less effort has been devoted to parsing of medical documents. We present an algorithm specifically developed for medical applications. The algorithm consists of two components. First, a rule-based algorithm is used to detect the sections that contain labels. It utilizes a knowledge base of commonly employed heading labels and linguistic cues seen within training examples. The second part of the algorithm handles the detection of unlabeled sections. It uses a combination of lexical pattern recognition and a classifier based on an expectation model for a particular class of medical reports. The proposed method was evaluated on three test corpora containing a total of 129,303 report sections. The detection rates for labeled and unlabeled sections for individual corpus ranged from 97.4% to 99.4% and from 96.5% to 99.0%, respectively. The rule-based approach is particularly effective for medical reports due to inherently structured nature of these documents. INTRODUCTION Segmentation of medical reports into topically cohesive sections is an essential task in patient information gathering and dissemination. Medical document retrieval systems can improve their indexing by knowing which parts of the report are relevant for specific types of queries. Clinical workstations could provide more elegant means of visualizing long and/or numerous medical reports for a patient if section breaks are known. Often, specific users are interested only in a subset of the text fields within a report. For example , a clinician may wish only to see a diagnostic conclusion section of a pathology or radiology report. An administrator may be only interested in the study description and reason for request sections. A coding system that uses natural language processing would benefit greatly by knowing which sections contain subjective (e.g., “Chief Complaint”) versus objective (e.g., “Findings”) patient descriptions An automated section extractor would also be useful in rapid generation of medical reports. Static data such as personal identification, history of illness, familial information, etc. are usually repeated in serial reports. An intelligent reporting system would automatically create a template for an existing patient with the static data already in place. Such a system would allow physicians to dictate only the new in formation. Medical reports generated today are rarely formatted such that structural boundaries are known to a computer program. One reason is that the requirement for manual tagging of section boundaries would reduce transcription throughput. It would also require that a consensus be developed to match the target sections implied by the dictating physician. The problem is the same for speech recognition systems, which would require the physician to dictate the specific section names and that these section types be known by the speech recognition system. These may be steps that could again slow throughput and disturb the concentration of the dictating physician. A plethora of algorithms has been proposed for computerized text segmentation. Skorochod’ko examined the degree of word overlap among the sentences to determine lexical connectivity [1]. Likewise, Halliday and Hansan utilized vocabulary similarity measures [2]. Morris and Hirst advanced the theory of lexical coherence and developed a thesaurus-based method to form lexical chains from which texts were structured [3]. Kozima proposed a semantic network to compute lexical cohesiveness between words [4]. Reynar in troduced a graphical technique called dotplotting that detected topic boundaries by observing word repetition [5]. Hearst developed the TextTiling algorithm which utilizes patterns of lexical co-occurrence and distribution to detect changes in subtopics [6,7]. Also use of cue words to detect section transitions has been explored by some investigators [8,9]. For updated bibliography of works in the past decade see Pevner and Hearst [10]. Text segmentation algorithms have been applied to passage retrieval [11], automated summarization [12], genre detection [13], and story segmentation of news feeds [14]. However, none of the previously published methods was developed specifically for AMIA 2003 Symposium Proceedings − Page 155 Presently there is no universal standard or format for written medical reports in the U.S. While there are similarities, each institution, department, and individual physician has a unique policy and style of reporting. After examining a large number of reports from multiple institutions, it was decided that a supervised learning approach with its ability to adapt to local features would be most suitable for the task at hand. During dictation it is customary for the physician to preface each section of the report with an appropriate heading such as “history”, “procedure”, “findings”, etc. Subsequently, these cue words are detected by the transcriptionist. Report structure intended by the author is then encoded into written document by insertion of section labels. Most commonly the section labels are written in upper-case characters followed by a colon. The section headings, however, are occasionally omitted by the dictating physician or inadvertently missed by the transcriber. Another clue of section boundary is provided by the transcriptionist who may insert paragraph breaks between sections. However, some favor faster wrap around typing style without insertion of hard carriage return throughout the document. The structure of the report may also be apparent from the document category, which is often included in the header. For example, a report may be an inpatient note, a discharge summary, an operation report, a procedure note, an outpatient consultation, or a letter. Depending on the category there are expected sections such as “Interval Events”, “Hospital Course”, “Discharge Diagnosis”, “Anesthesia”, and “Requesting Physician”. Individual idiosyncrasy is another clue. Some physician may like to open certain section with a certain phrase. Features that characterize section boundaries as described above are extracted from a set of training examples according to report type (defined at the level of department). Quality and quantity of training samples are of utmost importance. If exa mples are erroneous, this introduces noise in the data used for modeling. Obtaining sufficient training exa mples for the complete spectrum of patterns (feature space) seen for a parti (...truncated)