Understanding disciplinary vocabularies using a full-text enabled domain-independent term extraction approach

PLOS ONE, Nov 2019

Publication metadata help deliver rich analyses of scholarly communication. However, research concepts and ideas are more effectively expressed through unstructured fields such as full texts. Thus, the goals of this paper are to employ a full-text enabled method to extract terms relevant to disciplinary vocabularies, and through them, to understand the relationships between disciplines. This paper uses an efficient, domain-independent term extraction method to extract disciplinary vocabularies from a large multidisciplinary corpus of PLoS ONE publications. It finds a power-law pattern in the frequency distributions of terms present in each discipline, indicating a semantic richness potentially sufficient for further study and advanced analysis. The salient relationships amongst these vocabularies become apparent in application of a principal component analysis. For example, Mathematics and Computer and Information Sciences were found to have similar vocabulary use patterns along with Engineering and Physics; while Chemistry and the Social Sciences were found to exhibit contrasting vocabulary use patterns along with the Earth Sciences and Chemistry. These results have implications to studies of scholarly communication as scholars attempt to identify the epistemological cultures of disciplines, and as a full text-based methodology could lead to machine learning applications in the automated classification of scholarly work according to disciplinary vocabularies.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0187762&type=printable

Understanding disciplinary vocabularies using a full-text enabled domain-independent term extraction approach

November Understanding disciplinary vocabularies using a full-text enabled domain-independent term extraction approach Erjia Yan 0 1 Jake Williams 0 1 Zheng Chen 0 1 0 College of Computing and Informatics, Drexel University , Philadelphia, Pennsylvania , United States of America 1 Editor: Wolfgang Glanzel, KU Leuven , BELGIUM Publication metadata help deliver rich analyses of scholarly communication. However, research concepts and ideas are more effectively expressed through unstructured fields such as full texts. Thus, the goals of this paper are to employ a full-text enabled method to extract terms relevant to disciplinary vocabularies, and through them, to understand the relationships between disciplines. This paper uses an efficient, domain-independent term extraction method to extract disciplinary vocabularies from a large multidisciplinary corpus of PLoS ONE publications. It finds a power-law pattern in the frequency distributions of terms present in each discipline, indicating a semantic richness potentially sufficient for further study and advanced analysis. The salient relationships amongst these vocabularies become apparent in application of a principal component analysis. For example, Mathematics and Computer and Information Sciences were found to have similar vocabulary use patterns along with Engineering and Physics; while Chemistry and the Social Sciences were found to exhibit contrasting vocabulary use patterns along with the Earth Sciences and Chemistry. These results have implications to studies of scholarly communication as scholars attempt to identify the epistemological cultures of disciplines, and as a full text-based methodology could lead to machine learning applications in the automated classification of scholarly work according to disciplinary vocabularies. Introduction The bibliometric community has used scientific publications as an effective instrument to study scholarly communication. Traditionally, bibliometric indicators were employed to assess research impacts [1±3]. Recent advances in bibliometrics have benefited from the use of network and statistical approaches to map science [4±6] and identify author communities [7±10]. Publication metadata, such as authors, journals, and references, were primarily used as the unit of analysis in these prior endeavors. The use of a more content-rich componentÐfulltextsÐwas largely absent. Consequently, we made great efforts in examining research metadata but not research contents. The composition of the research landscape is evolvingÐdata, particularly scientific data, are increasing becoming open and accessible. The increased access to data not only provides more efficient means of analyses, but also entails a paradigmatic shift in modes of inquiry as scientists now can form diverse teams surrounded by data and conduct data-intensive research. The success of this transformation requires the use of new methods to extract more granular and content-rich information from large publication data. This need is within the realm of information extraction since computational linguists have developed methods to identify terms that can be used to describe domain-specific concepts from texts. While modern natural language processing techniques have yielded satisfying results on recall and precision, they were primarily employed with the objective of retrieval, as opposed to understanding. Accordingly, systematic approaches are lacking on how to utilize these methods to understand the latent meanings of the texts of scientific publications and how to use them to address questions on scholarly communication. Thus, the objectives of this paper are two-fold. First, it is motivated to develop a term weighting-based method to extract content-rich terms from full texts. These terms can be broadly perceived as expressions in texts that convey information about the research-relevant aspects of publications, such as methods, theories, and concepts. Second, it uses the extracted terms to compare and contrast disciplines' vocabulariesÐthese vocabularies are important signifiers of disciplinary discourse patterns and can be used to reveal the epistemological differences in disciplinary cultures, as Hyland [ 11 ] argued that ªwriting. . .[o]n the contrary, it helps to create those disciplinesº. The newly developed term extraction method allows us to examine the epistemological differences in a heretofore unattained extent, which complements the scholarship of the language aspect of disciplinarity studies that were largely confined to analyze samples of articles [ 12 ], dissertations [ 13 ], textbooks [ 14 ], and book reviews [ 15 ]. The paper provides insights into disciplinary vocabulary patterns and reveals scholarly communication at a new contextualized level. Conducting content-rich disciplinarity studies has the readily apparent advantage of gaining concrete and fine-grained perceptions of how different scientific concepts are embedded and re (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0187762&type=printable

Erjia Yan, Jake Williams, Zheng Chen. Understanding disciplinary vocabularies using a full-text enabled domain-independent term extraction approach, PLOS ONE, 2017, Volume 12, Issue 11, DOI: 10.1371/journal.pone.0187762