Authorship Attribution: What's Easy and What's Hard? (pdf)

Article PDF cannot be displayed. You can download it here:

https://brooklynworks.brooklaw.edu/cgi/viewcontent.cgi?article=1045&context=jlp

Authorship Attribution: What's Easy and What's Hard?

Journal of Law and Policy Volume 21 Issue 2 SYMPOSIUM: Authorship Attribution Workshop Article 4 2013 Authorship Attribution: What's Easy and What's Hard? Moshe Koppel, Ph.D. Jonathan Schler, Ph.D. Shlomo Argamon, Ph.D Follow this and additional works at: https://brooklynworks.brooklaw.edu/jlp Recommended Citation Moshe Koppel, Ph.D., Jonathan Schler, Ph.D. & Shlomo Argamon, Ph.D, Authorship Attribution: What's Easy and What's Hard?, 21 J. L. & Pol'y (2013). Available at: https://brooklynworks.brooklaw.edu/jlp/vol21/iss2/4 This Article is brought to you for free and open access by the Law Journals at BrooklynWorks. It has been accepted for inclusion in Journal of Law and Policy by an authorized editor of BrooklynWorks. AUTHORSHIP ATTRIBUTION: WHAT’S EASY AND WHAT’S HARD? Moshe Koppel,* Jonathan Schler,† and Shlomo Argamon** INTRODUCTION The simplest kind of authorship attribution problem—and the one that has received the most attention—is the one in which we are given a small, closed set of candidate authors and are asked to attribute an anonymous text to one of them. Usually, it is assumed that we have copious quantities of text by each candidate author and that the anonymous text is reasonably long. A number of recent survey papers1 amply cover the variety of methods used for solving this problem. Unfortunately, the kinds of authorship attribution problems we typically encounter in forensic contexts are more difficult than this simple version in a number of ways. First, the number of suspected writers might be very large, possibly numbering in the many thousands. Second, there is often no guarantee that the true author of an anonymous text is among the known suspects. Finally, the amount of writing we have by each candidate might be very limited and the anonymous text itself might be short. * Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel, (Corresponding Author). † Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel, . ** Department of Computer Science, Illinois Institute of Technology, . 1 Patrick Juola, Authorship Attribution, 1 FOUND. & TRENDS IN INFO. RETRIEVAL 233, 238–39 (2006); Moshe Koppel et al., Computational Methods in Authorship Attribution, 60 J. AM. SOC’Y FOR INFO. SCI. & TECH. 9, 9 (2009); Efstathios Stamatatos, A Survey of Modern Authorship Attribution Methods, 60 J. AM. SOC’Y FOR INFO. SCI. & TECH. 538, 539 (2009). 317 JOURNAL OF LAW AND POLICY 318 This paper considers four versions of the attribution problem that are typically encountered in the forensic context and offers algorithmic solutions for each. Part I describes the simple authorship attribution problem described above. Part II considers the long-text verification problem, in which we are asked if two long texts are by the same author. Part III discusses the many-candidates problem, in which we are asked which among thousands of candidate authors is the author of a given text. Finally, Part IV considers the fundamental problem of authorship attribution, in which we are asked if two short texts are by the same author. Although other researchers have considered these problems, here we offer our own solutions to each problem and indicate the degree of accuracy that can be expected in each case under specified conditions. I. SIMPLE AUTHORSHIP ATTRIBUTION The simplest problems arise when, as mentioned above, we have a closed set of candidate authors as well as an abundance of training text2 for each author. Our objective is to assign an anonymous text to one of the candidate authors. For this purpose, we wish to design automated techniques that use the available training text to assign a text to the most likely candidate author. As a rule, such automated techniques can be divided into two main types: similarity-based methods and machine-learning methods.3 In similarity-based methods, a metric is used to computationally measure the similarity between two documents, and the anonymous document is attributed to that author whose known writing (considered collectively as a single document) is most similar. Research in the similarity-based paradigm has focused on the choice of features for document representation— such as the frequency of particular words or other lexical or 2 Training text is simply a collection of writing samples by a given author that can be used to characterize the author’s writing style for purposes of attribution. 3 Stamatatos, supra note 1, at 551. WHAT’S EASY AND WHAT’S HARD? 319 syntactic features in the document—and on the choice of distance metric.4 In machine-learning methods, the known writings of each candidate author (considered as a set of distinct training documents) are used to construct a classifier that can then be used to categorize anonymous documents. The idea is to formally represent each of a set of training documents as a numerical vector and then use a learning algorithm to find a formal rule, known as a classifier, that assigns each such training vector to its known author. This same classifier can then be used to assign anonymous documents to (what one hopes is) the right author. Research in the machine-learning paradigm has focused on the choice of features for document representation and on the choice of learning algorithm.5 This section of the paper focuses on machine-learning methods. Here we consider and compare a variety of learning algorithms and feature sets for three authorship attribution problems that are representative of the range of classical attribution problems. The three problems are as follows: 1. A large set of emails between two correspondents (M. Koppel and J. Schler, co-authors of this paper), covering the year 2005. The set consisted of 246 emails from Koppel and 242 emails from Schler, each stripped of headers, named greetings, 4 See generally Ahmed Abbasi & Hsinchun Chen, Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace, 26 ACM TRANSACTIONS ON INFO. SYS. 7:1 (2008); Shlomo Argamon, Interpreting Burrows’s Delta: Geometric and Probabilistic Foundations, 23 LITERARY & LINGUISTIC COMPUTING 131 (2007); John Burrows, ‘Delta’: A Measure of Stylistic Difference and a Guide to Likely Authorship, 17 LITERARY & LINGUISTIC COMPUTING 267 (2002); Carole E. Chaski, Empirical Evaluations of Language-Based Author Identification Techniques, 8 INT’L J. SPEECH LANGUAGE & L. 1 (2001); David L. Hoover, Multivariate Analysis and the Study of Style Variation, 18 LITERARY & LINGUISTIC COMPUTING 341 (2003). 5 Abbasi & Chen, supra note 4, at 7:10; Koppel et al., supra note 1, at 11–12; Ying Zhao & Justin Zobel, Effective and Scalable Authorship Attribution Using Function Words, 3689 INFO. RETRIEVAL TECH. 174, 176 (2005); Rong Zheng et al., A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques, 57 J. AM. SOC’Y FOR INFO. SCI. & TECH. 378, 380 (2006). 320 JOURNAL OF LAW AND POL (...truncated)