Domain adaptation for semantic role labeling of clinical text
Zhang Y, et al. J Am Med Inform Assoc 2015;22:967–979. doi:10.1093/jamia/ocu048, Research and Applications
Domain adaptation for semantic role
labeling of clinical text
RECEIVED 15 April 2014
REVISED 5 December 2014
ACCEPTED 15 December 2014
PUBLISHED ONLINE FIRST 10 June 2015
Yaoyun Zhang1,*, Buzhou Tang1,2,*, Min Jiang1, Jingqi Wang1, Hua Xu1
ABSTRACT
....................................................................................................................................................
....................................................................................................................................................
Keywords: semantic role labeling, shallow semantic parsing, clinical natural language processing, domain adaptation, transfer learning
INTRODUCTION
BACKGROUND
Natural language processing (NLP) technologies are important for
unlocking information embedded in narrative reports in electronic
health record systems. Although various NLP systems have been developed to support a wide range of computerized medical applications, such as biosurveillance and clinical decision support,
extracting semantically meaningful information from clinical text remains a challenge. Semantic role labeling (SRL)1 (also known as
shallow semantic parsing),2 which extracts semantic relations between predicates and their arguments from different surface textual
forms, is an important method for the extraction of semantic information. State-of-the-art SRL systems have been developed and applied to information extraction in open domains and various
biomedical subdomains.3–12 However, very few SRL studies have
been conducted in the clinical domain,13,14 probably due to the lack
of large-scale annotated corpora. The creation of such clinical SRL
corpora would be both time-consuming and expensive.13
In this study, we approach SRL on clinical narratives as a domain
adaptation problem. The goal is to adapt existing the SRL corpora of
newswire text15,16 and biomedical literature17 to the clinical domain.
By transferring knowledge from existing corpora in other domains to
the clinical domain, we aim to improve the performance of clinical SRL
and reduce the cost of developing one de novo. We used three existing
SRL corpora outside the clinical domain and evaluated three state-ofthe-art domain adaptation algorithms on the task of SRL for clinical
text. Our results showed that domain adaptation strategies were effective for improving the performance or reducing the annotation cost of
SRL on clinical text. To the best of our knowledge, this is the first work
that has introduced domain adaptation algorithms for clinical SRL.
The task of SRL is to label semantic relations in a sentence as predicate argument structures (PASs) to represent propositions.18 The definition of PAS originated from the predicate logic for proposition
representation in semantics theory.2 There is a large body of work on
extracting semantic relations in biomedical text.4–12,19–23 Many are
based on the sublanguage theory by Harris,24 which describes the
properties of language in closed domains. Typically, in a closed domain such as medicine, there are a limited number of primary semantic types and a set of constraints that can determine how different
semantic types of the arguments can be linked to form semantic predications.25 Linguistic String Project (LSP)21 and Medical Language
Extraction and Encoding System (MedLEE),22 which use sublanguage
grammar, are two early NLP systems for the extraction of semantic relations in the medical domain. SemRep is another biomedical semantic relation extraction system, which extracts semantic predications
defined in the Unified Medical Language System Semantic Network
from biomedical literature.19,20 Recently, Cohen et al.26 examined the
syntactic alternations in the argument structure of domain-specific
verbs and associated nominalizations in the PennBioIE corpus, and
found that even in a semantically restricted domain, syntactic variations are common and diverse. Currently, many sublanguage-based
clinical NLP systems often recognize semantic relations24 by manually
extracted patterns using rule-based methods.22,27 SRL, however, focuses on unifying variations in the surface syntactic forms of semantic
relations based on annotated corpora. It is inspired by previous research into semantic frames28,29 and the link between semantic roles
and syntactic realization.30 Although current SRL approaches are primarily developed in open domains (thus, types of semantic roles or
Correspondence to Hua Xu, Ph.D., University of Texas School of Biomedical Informatics at Houston, 7000 Fannin St., Suite 870, Houston, TX 77030, USA;
; Tel: 713-500-3924
C The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please
V
email: For numbered affiliations see end of article.
967
RESEARCH
AND APPLICATIONS
Objective Semantic role labeling (SRL), which extracts a shallow semantic relation representation from different surface textual forms of free text
sentences, is important for understanding natural language. Few studies in SRL have been conducted in the medical domain, primarily due to lack
of annotated clinical SRL corpora, which are time-consuming and costly to build. The goal of this study is to investigate domain adaptation techniques for clinical SRL leveraging resources built from newswire and biomedical literature to improve performance and save annotation costs.
Materials and Methods Multisource Integrated Platform for Answering Clinical Questions (MiPACQ), a manually annotated SRL clinical corpus, was
used as the target domain dataset. PropBank and NomBank from newswire and BioProp from biomedical literature were used as source domain
datasets. Three state-of-the-art domain adaptation algorithms were employed: instance pruning, transfer self-training, and feature augmentation.
The SRL performance using different domain adaptation algorithms was evaluated by using 10-fold cross-validation on the MiPACQ corpus.
Learning curves for the different methods were generated to assess the effect of sample size.
Results and Conclusion When all three source domain corpora were used, the feature augmentation algorithm achieved statistically significant
higher F-measure (83.18%), compared to the baseline with MiPACQ dataset alone (F-measure, 81.53%), indicating that domain adaptation algorithms may improve SRL performance on clinical text. To achieve a comparable performance to the baseline method that used 90% of MiPACQ
training samples, the feature augmentation algorithm required <50% of training samples in MiPACQ, demonstrating that annotation costs of clinical SRL can be reduced significantly by leveraging existing SRL resources from other domains.
Zhang Y, et al. J Am Med Inform Assoc 2015;22:967–979. doi:10.1093/jamia/ocu048, Research and Applications
Figure 1. Syntacti (...truncated)