Nominalization and Alternations in Biomedical Language

PLOS ONE, Sep 2008

Background This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. Methodology/Principal Findings We examined 1,872 tokens of the ten most common domain-specific verbs or their zero-related nouns in the PennBioIE corpus and labelled them for the presence or absence of three alternations. We then annotated the arguments of 746 tokens of the nominalizations related to these verbs and counted alternations related to the presence or absence of arguments and to the syntactic position of non-absent arguments. We found that alternations are quite common both for verbs and for nominalizations. We also found a previously undescribed alternation involving an adjectival present participle. Conclusions/Significance We found that even in this semantically restricted domain, alternations are quite common, and alternations involving nominalizations are exceptionally diverse. Nonetheless, the sublanguage model applies to biomedical language. We also report on a previously undescribed alternation involving an adjectival present participle.

Nominalization and Alternations in Biomedical Language

Citation: Cohen KB, Palmer M, Hunter L ( Nominalization and Alternations in Biomedical Language K. Bretonnel Cohen 0 Martha Palmer 0 Lawrence Hunter 0 Robert P. Futrelle, Northeastern University, United States of America 0 1 Center for Computational Pharmacology, University of Colorado School of Medicine , Aurora , Colorado, United States of America, 2 Department of Linguistics, University of Colorado at Boulder , Boulder, Colorado , United States of America Background: This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. Methodology/Principal Findings: We examined 1,872 tokens of the ten most common domain-specific verbs or their zerorelated nouns in the PennBioIE corpus and labelled them for the presence or absence of three alternations. We then annotated the arguments of 746 tokens of the nominalizations related to these verbs and counted alternations related to the presence or absence of arguments and to the syntactic position of non-absent arguments. We found that alternations are quite common both for verbs and for nominalizations. We also found a previously undescribed alternation involving an adjectival present participle. Conclusions/Significance: We found that even in this semantically restricted domain, alternations are quite common, and alternations involving nominalizations are exceptionally diverse. Nonetheless, the sublanguage model applies to biomedical language. We also report on a previously undescribed alternation involving an adjectival present participle. - Funding: K. Bretonnel Cohens and Lawrence Hunters work was supported by grants G08LM009639, R01LM009254, and R01LM008111. Martha Palmers work was supported by NSF grant CISE-CRI 0551615. J. Gregory Caporaso is supported by training grant fellowship T15LM009451. No sponsors or funders were involved in the design or conduct of the study; in the collection, analysis, or interpretation of the data; or in the preparation, review, or approval of the manuscript. Competing Interests: The authors have declared that no competing interests exist. This work is a step toward understanding the syntactic and semantic aspects of verb meaning in the biomedical domain. The goal is to lay the groundwork for a set of representations of domainspecific verbs that is broad enough in its coverage to scale up to realistic problems in information extraction, and deep enough in its representation to support accurate extraction of information in the face of syntactic variability and to allow for the resolution of coreferential and related (e.g. elliptical) references in text. In an initial step, we sought to answer a very basic question: do alternations occur in biomedical texts? (Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs.) We approached the problem by determining what the most frequent verbs are in biomedical text, then analyzing those verbs and their nominalizations in terms of the alternations that they participate in. Of the many classes of alternations that verbs participate in, we looked specifically at the passive alternation (Levin classes 5.1 Verbal Passive, 5.3 Adjectival Passive, and 5.4 Adjectival Perfect Participle) and at alternations related to transitivity (Levin class 1 Transitivity alternations and its descendants). We also report a previously undescribed alternation, Adjectival Present Participle. For the nouns, we examined alternations in the presence or absence of arguments and in the syntactic position of non-absent arguments. One characteristic of alternations is that they preserve the underlying semantics of an assertion even in the face of syntactic variability. For example, one commonly known alternation is the passive alternation. One claim of an alternations-based approach to explaining syntactic/semantic relations is that in N FSH stimulates follicular development (PMID 12021046) and N follicular development is stimulated by FSH (PMID 6615964) the underlying semantics of the sentences, i.e. that FSH is the stimulator and follicular development is the thing that is stimulated, is the same, even though in the first sentence FSH is the grammatical subject and follicular development is the grammatical object, while in the second sentence follicular development becomes the grammatical subject and there is no grammatical object, per se. Alternations have been a topic of interest in the theoretical linguistics literature because they are thought to shed light on what is known in linguistics as the mapping problem: how it is that underlying semantics are realized in the syntax of sentences. One assumption of the model is that verbs with shared semantics will participate in the same alternations. Alternations are of relevance to language processing and text mining because of the contribution that they might make to the development of broad-coverage rule- and pattern-based systems for relation extraction: if verbs with similar semantics do participate in the same alternations, then it might be possible to take advantage of this by inheriting or otherwise reusing abstract rules in broad classes of verbs. For example, if it turns out to be the case that transitive verbs share the trait of being able to occur in the passive alternation, then system developers might be able to write just two rules for extracting relations from active and passive sentences and share those between all transitive verbs, rather than writing a separate active rule and a separate passive rule for each transitive verb in the lexicon. Levin (1993) [1] identified fifty major classes of alternations. That work also identified 49 major semantic classes of verbs, grouped according to the alternations in which they do and do not participate. (There are also subclasses of the fifty major classes of alternations and of the 49 major classes of verbs.) To illustrate the relationship between the semantics of related verbs and their shared syntactic behaviors, consider what Levin termed calibratable change-of-state verbs. These verbssuch as increaseshare the semantic characteristics of a state-change in the logical object of the verb, and the syntactic behavior that when they are intransitive, the grammatical subject of the verb is the undergoer of the change (i.e., is the logical object). Thus, in N the addition of hCG alone significantly increased lyase activity in these cells (PMID 2788776) the verb increase is transitive and lyase activity is both the grammatical (...truncated)


This is a preview of a remote PDF: http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0003158&type=printable
Article home page: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0003158

K. Bretonnel Cohen, Martha Palmer, Lawrence Hunter. Nominalization and Alternations in Biomedical Language, PLOS ONE, 2008, Volume 3, Issue 9, DOI: 10.1371/journal.pone.0003158