A Systemic Functional Approach To Automated Authorship Analysis (pdf)

Article PDF cannot be displayed. You can download it here:

https://brooklynworks.brooklaw.edu/cgi/viewcontent.cgi?article=1044&context=jlp

A Systemic Functional Approach To Automated Authorship Analysis

Journal of Law and Policy Volume 21 Issue 2 SYMPOSIUM: Authorship Attribution Workshop Article 3 2013 A Systemic Functional Approach To Automated Authorship Analysis Shlomo Argamon, Ph.D Moshe Koppel, Ph.D. Follow this and additional works at: https://brooklynworks.brooklaw.edu/jlp Recommended Citation Shlomo Argamon, Ph.D & Moshe Koppel, Ph.D., A Systemic Functional Approach To Automated Authorship Analysis, 21 J. L. & Pol'y (2013). Available at: https://brooklynworks.brooklaw.edu/jlp/vol21/iss2/3 This Article is brought to you for free and open access by the Law Journals at BrooklynWorks. It has been accepted for inclusion in Journal of Law and Policy by an authorized editor of BrooklynWorks. A SYSTEMIC FUNCTIONAL APPROACH TO AUTOMATED AUTHORSHIP ANALYSIS Shlomo Argamon* and Moshe Koppel** INTRODUCTION Attribution of anonymous texts, if not based on factors external to the text (such as paper and ink type or document provenance, as used in forensic document examination), is largely, if not entirely, based on considerations of language style. We will consider here the question of how to best deconstruct a text into quantitative features for purposes of stylistic discrimination. Two key considerations inform our analysis. First, such features should support accurate classification by automated methods. Second, and no less importantly, such features should enable a clear explanation of the stylistic difference between stylistic categories (read: authors) and why a disputed text appears more likely to fall into one or another category. The latter consideration is particularly important when a nonexpert, such as a judge or jury, must evaluate the results and reliability of the analysis. We start from the intuitive notion that style is indicated in a text by those features of the text that indicate the author’s choice of one mode of expression from among a set of equivalent modes for a given content. There are many ways in which such choices manifest themselves in a text. Specific words and phrases may be chosen more frequently by certain authors than others, such as the phrase “cool-headed logician” favored by the Unabomber. Some authors may habitually use certain syntactic * Linguistic Cognition Laboratory, Department of Computer Science, Illinois Institute of Technology, . ** Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel, . 299 JOURNAL OF LAW AND POLICY 300 constructions more frequently, as in Hemingway’s preference for short, simple clauses. Differences between authors will also arise at the level of the organization of the text as a whole, as some people may prefer to make reasoned arguments from evidence to conclusions, and others may prefer emotional appeals organized differently. However, all of these “surface” linguistic phenomena have multiple potential underlying causes, not only authorship. They include the genre, register, and purpose of the text as well as the educational background, social status, and personality of the author and audience.1 What all these dimensions of variation have in common, though, is independence, to a greater or lesser extent, of the “topic” of the text. Hence the traditional focus in computational authorship attribution on features such as function word usage; vocabulary richness and complexity measures; and frequencies of different syntactic structures; which are essentially nonreferential. Early statistical attribution techniques relied on relatively small numbers of such features, while developments in machine learning and computational linguistics over the last fifteen to twenty years have enabled larger numbers of features to be generated for stylistic analysis. However, in almost no case is there strong theoretical motivation behind the input feature sets, such that the features have clear interpretations in stylistic terms. We argue, however, that without a firm basis in a linguistic theory of meaning (not just of syntax), we are unlikely to gain any true insight into the nature of any stylistic distinction being studied. Such understanding is key to both establishing and explaining evidence for a proposed attribution. Otherwise, an attribution method is merely a black box that may appear to work for extrinsic or accidental reasons but not actually give reliable results in a given case. Furthermore, an attribution method that produces insight into the relevant language variation is more likely to be useful and accepted in a forensic context, all else being equal, as the judge and jury will be better able to understand the results. 1 DOUGLAS BIBER & SUSAN CONRAD, REGISTER, GENRE, AND STYLE (P. Austin et al. eds., 2009). AUTOMATED AUTHORSHIP ANALYSIS 301 We therefore sketch here a computationally tractable formulation of linguistically and stylistically well-motivated features we have developed that permits text classification based on specific variation in choice of nonreferential meanings. The system produces meaningful information about the stylistic distinctions being analyzed, which can be used for interpretative and forensic purposes. We will explain our methodology and then use it as a case study for what any such methodology should provide. Before we begin, it is worth briefly surveying the variety of problems that fall under the umbrella of “authorship analysis.” The simplest form of the problem is where an anonymous document is potentially attributable to one of a relatively small number (two to fifty, or so) of suspects. The question is then simply which of the suspects has a writing style most like that of the anonymous document. More difficult (and much more likely in the real world) is the case where the document might not be authored by any of the suspects at all—in this case we must be able to determine that the document is not enough like any of the suspects to attribute authorship. The hardest version of this scenario is authorship verification, where the question is whether a single suspect did or did not author the anonymous document. All such authorship attribution scenarios assume a known set of suspects who are being evaluated for authorship of the questioned document. We require some quantity of texts written by each of the suspects to determine authorship. On the other hand, if, as is often the case in police investigations, specific suspects are not known, we must consider the task of authorship profiling, determining as much about the author as possible, based upon clues in the document. As we will discuss below, a number of personal characteristics of an author can be reliably estimated from stylistic cues in a document. But first we will consider generally how we can quantitatively characterize the style of a text for computational analysis. JOURNAL OF LAW AND POLICY 302 I. FUNCTIONAL LEXICAL FEATURES Our methodology is based on Halliday’s Systemic Functional Grammar2 (“SFG”), which we find to be particularly well-suited to the sort of computational anal (...truncated)