Authorship Attribution: What's Easy and What's Hard?
Journal of Law and Policy
Volume 21
Issue 2
SYMPOSIUM:
Authorship Attribution Workshop
Article 4
2013
Authorship Attribution: What's Easy and What's
Hard?
Moshe Koppel, Ph.D.
Jonathan Schler, Ph.D.
Shlomo Argamon, Ph.D
Follow this and additional works at: https://brooklynworks.brooklaw.edu/jlp
Recommended Citation
Moshe Koppel, Ph.D., Jonathan Schler, Ph.D. & Shlomo Argamon, Ph.D, Authorship Attribution: What's Easy and What's Hard?, 21 J. L.
& Pol'y (2013).
Available at: https://brooklynworks.brooklaw.edu/jlp/vol21/iss2/4
This Article is brought to you for free and open access by the Law Journals at BrooklynWorks. It has been accepted for inclusion in Journal of Law and
Policy by an authorized editor of BrooklynWorks.
AUTHORSHIP ATTRIBUTION: WHAT’S
EASY AND WHAT’S HARD?
Moshe Koppel,* Jonathan Schler,† and Shlomo Argamon**
INTRODUCTION
The simplest kind of authorship attribution problem—and the
one that has received the most attention—is the one in which we
are given a small, closed set of candidate authors and are asked
to attribute an anonymous text to one of them. Usually, it is
assumed that we have copious quantities of text by each
candidate author and that the anonymous text is reasonably long.
A number of recent survey papers1 amply cover the variety of
methods used for solving this problem.
Unfortunately, the kinds of authorship attribution problems
we typically encounter in forensic contexts are more difficult
than this simple version in a number of ways. First, the number
of suspected writers might be very large, possibly numbering in
the many thousands. Second, there is often no guarantee that the
true author of an anonymous text is among the known suspects.
Finally, the amount of writing we have by each candidate might
be very limited and the anonymous text itself might be short.
* Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel,
(Corresponding Author).
† Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel,
.
** Department of Computer Science, Illinois Institute of Technology,
.
1
Patrick Juola, Authorship Attribution, 1 FOUND. & TRENDS IN INFO.
RETRIEVAL 233, 238–39 (2006); Moshe Koppel et al., Computational
Methods in Authorship Attribution, 60 J. AM. SOC’Y FOR INFO. SCI. & TECH.
9, 9 (2009); Efstathios Stamatatos, A Survey of Modern Authorship
Attribution Methods, 60 J. AM. SOC’Y FOR INFO. SCI. & TECH. 538, 539
(2009).
317
JOURNAL OF LAW AND POLICY
318
This paper considers four versions of the attribution problem
that are typically encountered in the forensic context and offers
algorithmic solutions for each. Part I describes the simple
authorship attribution problem described above. Part II
considers the long-text verification problem, in which we are
asked if two long texts are by the same author. Part III discusses
the many-candidates problem, in which we are asked which
among thousands of candidate authors is the author of a given
text. Finally, Part IV considers the fundamental problem of
authorship attribution, in which we are asked if two short texts
are by the same author. Although other researchers have
considered these problems, here we offer our own solutions to
each problem and indicate the degree of accuracy that can be
expected in each case under specified conditions.
I. SIMPLE AUTHORSHIP ATTRIBUTION
The simplest problems arise when, as mentioned above, we
have a closed set of candidate authors as well as an abundance
of training text2 for each author. Our objective is to assign an
anonymous text to one of the candidate authors. For this
purpose, we wish to design automated techniques that use the
available training text to assign a text to the most likely
candidate author. As a rule, such automated techniques can be
divided into two main types: similarity-based methods and
machine-learning methods.3
In similarity-based methods, a metric is used to
computationally measure the similarity between two documents,
and the anonymous document is attributed to that author whose
known writing (considered collectively as a single document) is
most similar. Research in the similarity-based paradigm has
focused on the choice of features for document representation—
such as the frequency of particular words or other lexical or
2
Training text is simply a collection of writing samples by a given
author that can be used to characterize the author’s writing style for purposes
of attribution.
3
Stamatatos, supra note 1, at 551.
WHAT’S EASY AND WHAT’S HARD?
319
syntactic features in the document—and on the choice of distance
metric.4
In machine-learning methods, the known writings of each
candidate author (considered as a set of distinct training
documents) are used to construct a classifier that can then be
used to categorize anonymous documents. The idea is to
formally represent each of a set of training documents as a
numerical vector and then use a learning algorithm to find a
formal rule, known as a classifier, that assigns each such
training vector to its known author. This same classifier can then
be used to assign anonymous documents to (what one hopes is)
the right author. Research in the machine-learning paradigm has
focused on the choice of features for document representation
and on the choice of learning algorithm.5
This section of the paper focuses on machine-learning
methods. Here we consider and compare a variety of learning
algorithms and feature sets for three authorship attribution
problems that are representative of the range of classical
attribution problems. The three problems are as follows:
1. A large set of emails between two correspondents (M.
Koppel and J. Schler, co-authors of this paper), covering the
year 2005. The set consisted of 246 emails from Koppel and 242
emails from Schler, each stripped of headers, named greetings,
4
See generally Ahmed Abbasi & Hsinchun Chen, Writeprints: A
Stylometric Approach to Identity-Level Identification and Similarity Detection
in Cyberspace, 26 ACM TRANSACTIONS ON INFO. SYS. 7:1 (2008); Shlomo
Argamon, Interpreting Burrows’s Delta: Geometric and Probabilistic
Foundations, 23 LITERARY & LINGUISTIC COMPUTING 131 (2007); John
Burrows, ‘Delta’: A Measure of Stylistic Difference and a Guide to Likely
Authorship, 17 LITERARY & LINGUISTIC COMPUTING 267 (2002); Carole E.
Chaski, Empirical Evaluations of Language-Based Author Identification
Techniques, 8 INT’L J. SPEECH LANGUAGE & L. 1 (2001); David L. Hoover,
Multivariate Analysis and the Study of Style Variation, 18 LITERARY &
LINGUISTIC COMPUTING 341 (2003).
5
Abbasi & Chen, supra note 4, at 7:10; Koppel et al., supra note 1, at
11–12; Ying Zhao & Justin Zobel, Effective and Scalable Authorship
Attribution Using Function Words, 3689 INFO. RETRIEVAL TECH. 174, 176
(2005); Rong Zheng et al., A Framework for Authorship Identification of
Online Messages: Writing-Style Features and Classification Techniques, 57 J.
AM. SOC’Y FOR INFO. SCI. & TECH. 378, 380 (2006).
320
JOURNAL OF LAW AND POL (...truncated)