Automatic section segmentation of medical reports.
Automatic Section Segmentation of Medical Reports
Paul S. Cho, Ph.D.,1 Ricky K. Taira, Ph.D., 2 and Hooshang Kangarloo, M.D. 2
1
Department of Radiation Oncology, University of Washington, Seattle, WA
2
Medical Informatics Group, University of California, Los Angeles, CA
Abstract
Automated segmentation of medical reports can significantly enhance the productivity of the healthcare
departments. While many algorithms have been developed for document summarization, passage retrieval, and story segmentation of news feeds, much
less effort has been devoted to parsing of medical
documents. We present an algorithm specifically developed for medical applications. The algorithm consists of two components. First, a rule-based algorithm is used to detect the sections that contain labels. It utilizes a knowledge base of commonly employed heading labels and linguistic cues seen within
training examples. The second part of the algorithm
handles the detection of unlabeled sections. It uses a
combination of lexical pattern recognition and a
classifier based on an expectation model for a particular class of medical reports. The proposed
method was evaluated on three test corpora containing a total of 129,303 report sections. The detection
rates for labeled and unlabeled sections for individual corpus ranged from 97.4% to 99.4% and from
96.5% to 99.0%, respectively. The rule-based approach is particularly effective for medical reports
due to inherently structured nature of these documents.
INTRODUCTION
Segmentation of medical reports into topically cohesive sections is an essential task in patient information gathering and dissemination. Medical document
retrieval systems can improve their indexing by
knowing which parts of the report are relevant for
specific types of queries. Clinical workstations could
provide more elegant means of visualizing long
and/or numerous medical reports for a patient if section breaks are known. Often, specific users are interested only in a subset of the text fields within a report. For example , a clinician may wish only to see a
diagnostic conclusion section of a pathology or radiology report. An administrator may be only interested in the study description and reason for request
sections. A coding system that uses natural language
processing would benefit greatly by knowing which
sections contain subjective (e.g., “Chief Complaint”)
versus objective (e.g., “Findings”) patient descriptions An automated section extractor would also be
useful in rapid generation of medical reports. Static
data such as personal identification, history of illness,
familial information, etc. are usually repeated in serial reports. An intelligent reporting system would
automatically create a template for an existing patient
with the static data already in place. Such a system
would allow physicians to dictate only the new in formation.
Medical reports generated today are rarely formatted such that structural boundaries are known to a
computer program. One reason is that the requirement for manual tagging of section boundaries would
reduce transcription throughput. It would also require
that a consensus be developed to match the target
sections implied by the dictating physician. The problem is the same for speech recognition systems,
which would require the physician to dictate the specific section names and that these section types be
known by the speech recognition system. These may
be steps that could again slow throughput and disturb
the concentration of the dictating physician.
A plethora of algorithms has been proposed for
computerized text segmentation. Skorochod’ko examined the degree of word overlap among the sentences to determine lexical connectivity [1]. Likewise, Halliday and Hansan utilized vocabulary similarity measures [2]. Morris and Hirst advanced the
theory of lexical coherence and developed a thesaurus-based method to form lexical chains from which
texts were structured [3]. Kozima proposed a semantic network to compute lexical cohesiveness between
words [4]. Reynar in troduced a graphical technique
called dotplotting that detected topic boundaries by
observing word repetition [5]. Hearst developed the
TextTiling algorithm which utilizes patterns of lexical
co-occurrence and distribution to detect changes in
subtopics [6,7]. Also use of cue words to detect section transitions has been explored by some investigators [8,9]. For updated bibliography of works in the
past decade see Pevner and Hearst [10].
Text segmentation algorithms have been applied
to passage retrieval [11], automated summarization
[12], genre detection [13], and story segmentation of
news feeds [14]. However, none of the previously
published methods was developed specifically for
AMIA 2003 Symposium Proceedings − Page 155
Presently there is no universal standard or format for
written medical reports in the U.S. While there are
similarities, each institution, department, and individual physician has a unique policy and style of reporting. After examining a large number of reports
from multiple institutions, it was decided that a supervised learning approach with its ability to adapt to
local features would be most suitable for the task at
hand.
During dictation it is customary for the physician
to preface each section of the report with an appropriate heading such as “history”, “procedure”, “findings”, etc. Subsequently, these cue words are detected by the transcriptionist. Report structure intended by the author is then encoded into written
document by insertion of section labels. Most commonly the section labels are written in upper-case
characters followed by a colon. The section headings,
however, are occasionally omitted by the dictating
physician or inadvertently missed by the transcriber.
Another clue of section boundary is provided by the
transcriptionist who may insert paragraph breaks between sections. However, some favor faster wrap
around typing style without insertion of hard carriage
return throughout the document. The structure of the
report may also be apparent from the document category, which is often included in the header. For example, a report may be an inpatient note, a discharge
summary, an operation report, a procedure note, an
outpatient consultation, or a letter. Depending on the
category there are expected sections such as “Interval
Events”, “Hospital Course”, “Discharge Diagnosis”,
“Anesthesia”, and “Requesting Physician”. Individual idiosyncrasy is another clue. Some physician may
like to open certain section with a certain phrase.
Features that characterize section boundaries as
described above are extracted from a set of training
examples according to report type (defined at the
level of department). Quality and quantity of training
samples are of utmost importance. If exa mples are
erroneous, this introduces noise in the data used for
modeling. Obtaining sufficient training exa mples for
the complete spectrum of patterns (feature space)
seen for a parti (...truncated)