Domain adaptation for semantic role labeling of clinical text (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/jamia/article-pdf/22/5/967/34146440/ocu048.pdf

Domain adaptation for semantic role labeling of clinical text

Zhang Y, et al. J Am Med Inform Assoc 2015;22:967–979. doi:10.1093/jamia/ocu048, Research and Applications Domain adaptation for semantic role labeling of clinical text RECEIVED 15 April 2014 REVISED 5 December 2014 ACCEPTED 15 December 2014 PUBLISHED ONLINE FIRST 10 June 2015 Yaoyun Zhang1,*, Buzhou Tang1,2,*, Min Jiang1, Jingqi Wang1, Hua Xu1 ABSTRACT .................................................................................................................................................... .................................................................................................................................................... Keywords: semantic role labeling, shallow semantic parsing, clinical natural language processing, domain adaptation, transfer learning INTRODUCTION BACKGROUND Natural language processing (NLP) technologies are important for unlocking information embedded in narrative reports in electronic health record systems. Although various NLP systems have been developed to support a wide range of computerized medical applications, such as biosurveillance and clinical decision support, extracting semantically meaningful information from clinical text remains a challenge. Semantic role labeling (SRL)1 (also known as shallow semantic parsing),2 which extracts semantic relations between predicates and their arguments from different surface textual forms, is an important method for the extraction of semantic information. State-of-the-art SRL systems have been developed and applied to information extraction in open domains and various biomedical subdomains.3–12 However, very few SRL studies have been conducted in the clinical domain,13,14 probably due to the lack of large-scale annotated corpora. The creation of such clinical SRL corpora would be both time-consuming and expensive.13 In this study, we approach SRL on clinical narratives as a domain adaptation problem. The goal is to adapt existing the SRL corpora of newswire text15,16 and biomedical literature17 to the clinical domain. By transferring knowledge from existing corpora in other domains to the clinical domain, we aim to improve the performance of clinical SRL and reduce the cost of developing one de novo. We used three existing SRL corpora outside the clinical domain and evaluated three state-ofthe-art domain adaptation algorithms on the task of SRL for clinical text. Our results showed that domain adaptation strategies were effective for improving the performance or reducing the annotation cost of SRL on clinical text. To the best of our knowledge, this is the first work that has introduced domain adaptation algorithms for clinical SRL. The task of SRL is to label semantic relations in a sentence as predicate argument structures (PASs) to represent propositions.18 The definition of PAS originated from the predicate logic for proposition representation in semantics theory.2 There is a large body of work on extracting semantic relations in biomedical text.4–12,19–23 Many are based on the sublanguage theory by Harris,24 which describes the properties of language in closed domains. Typically, in a closed domain such as medicine, there are a limited number of primary semantic types and a set of constraints that can determine how different semantic types of the arguments can be linked to form semantic predications.25 Linguistic String Project (LSP)21 and Medical Language Extraction and Encoding System (MedLEE),22 which use sublanguage grammar, are two early NLP systems for the extraction of semantic relations in the medical domain. SemRep is another biomedical semantic relation extraction system, which extracts semantic predications defined in the Unified Medical Language System Semantic Network from biomedical literature.19,20 Recently, Cohen et al.26 examined the syntactic alternations in the argument structure of domain-specific verbs and associated nominalizations in the PennBioIE corpus, and found that even in a semantically restricted domain, syntactic variations are common and diverse. Currently, many sublanguage-based clinical NLP systems often recognize semantic relations24 by manually extracted patterns using rule-based methods.22,27 SRL, however, focuses on unifying variations in the surface syntactic forms of semantic relations based on annotated corpora. It is inspired by previous research into semantic frames28,29 and the link between semantic roles and syntactic realization.30 Although current SRL approaches are primarily developed in open domains (thus, types of semantic roles or Correspondence to Hua Xu, Ph.D., University of Texas School of Biomedical Informatics at Houston, 7000 Fannin St., Suite 870, Houston, TX 77030, USA; ; Tel: 713-500-3924 C The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please V email: For numbered affiliations see end of article. 967 RESEARCH AND APPLICATIONS Objective Semantic role labeling (SRL), which extracts a shallow semantic relation representation from different surface textual forms of free text sentences, is important for understanding natural language. Few studies in SRL have been conducted in the medical domain, primarily due to lack of annotated clinical SRL corpora, which are time-consuming and costly to build. The goal of this study is to investigate domain adaptation techniques for clinical SRL leveraging resources built from newswire and biomedical literature to improve performance and save annotation costs. Materials and Methods Multisource Integrated Platform for Answering Clinical Questions (MiPACQ), a manually annotated SRL clinical corpus, was used as the target domain dataset. PropBank and NomBank from newswire and BioProp from biomedical literature were used as source domain datasets. Three state-of-the-art domain adaptation algorithms were employed: instance pruning, transfer self-training, and feature augmentation. The SRL performance using different domain adaptation algorithms was evaluated by using 10-fold cross-validation on the MiPACQ corpus. Learning curves for the different methods were generated to assess the effect of sample size. Results and Conclusion When all three source domain corpora were used, the feature augmentation algorithm achieved statistically significant higher F-measure (83.18%), compared to the baseline with MiPACQ dataset alone (F-measure, 81.53%), indicating that domain adaptation algorithms may improve SRL performance on clinical text. To achieve a comparable performance to the baseline method that used 90% of MiPACQ training samples, the feature augmentation algorithm required <50% of training samples in MiPACQ, demonstrating that annotation costs of clinical SRL can be reduced significantly by leveraging existing SRL resources from other domains. Zhang Y, et al. J Am Med Inform Assoc 2015;22:967–979. doi:10.1093/jamia/ocu048, Research and Applications Figure 1. Syntacti (...truncated)